Extreme flooding has become one of the most pressing environmental challenges affecting many countries worldwide. In late 2025, Sri Lanka faced one of the most severe flood emergencies in its recent history. As reported by BBC News on 30 November 2025, the government declared a national state of emergency after widespread floods and mudslides displaced thousands of citizens and caused significant infrastructure damage. The president described the crisis as the “most challenging natural disaster” the country has ever confronted. This event drew international attention and highlighted the growing vulnerability of many regions to climate-related hazards.
Motivated by this real-world context, this project focuses on studying flood risk in Sri Lanka using a rich, multidimensional dataset. The dataset contains 25,000 geo-referenced observations that combine environmental indicators (such as recent rainfall and vegetation indices), terrain characteristics (elevation, proximity to rivers), land-use descriptors, demographic density, infrastructure quality, and historical flood occurrences. Importantly, it provides a continuous flood risk score, which enables quantitative exploration of how geographical, environmental, and socio-economic factors jointly shape exposure to floods.
The originality of this project lies in connecting a current and socially relevant issue with a data-driven risk analysis. Although the dataset is simulated, it is constructed using realistic geospatial and environmental proxies. This makes it suitable for understanding which features are associated with flood-prone areas and for illustrating the usefulness of empirical data in the context of disaster preparedness. By focusing on a case that is both timely and meaningful, the project aims to provide a thoughtful, evidence-based perspective on a problem of growing global significance.
The dataset used in this project comes from the Kaggle repository Sri Lanka Flood Risk and Inundation Dataset, which provides a synthetic but realistic geospatial representation of 25,000 locations across Sri Lanka. Each observation corresponds to a small spatial unit described by environmental, topographic, socio-demographic and infrastructural indicators. The table below summarises the definition of all variables as reported in the official Kaggle documentation.
| Column | Type | Description |
|---|---|---|
| record_id | string | Synthetic unique record id (e.g., F100000). |
| district | string | Administrative district name (Sri Lanka). |
| place_name | string | Synthetic town/village name. |
| latitude | float | WGS84 latitude. |
| longitude | float | WGS84 longitude. |
| elevation_m | integer | Elevation above sea level (meters). |
| distance_to_river_m | float | Distance to nearest river (meters). |
| landcover | string | Land cover type: Urban, Agriculture, Forest, Wetland, etc. |
| soil_type | string | Soil category (Clay, Sandy, Loamy, Peaty, Silty). |
| water_supply | string | Primary water source (Municipal, Well, Surface, etc.). |
| electricity | string | Electricity availability type. |
| road_quality | string | Road access / quality category. |
| population_density_per_km2 | integer | Population density (people per km²). |
| built_up_percent | float | Percent built-up area in the location. |
| urban_rural | string | Urban / Rural classification. |
| rainfall_7d_mm | float | Rainfall accumulated over the last 7 days (mm). |
| monthly_rainfall_mm | float | Monthly accumulated rainfall (mm). |
| drainage_index | float | Proxy index 0–1 (lower values indicate poorer drainage). |
| ndvi | float | Normalized Difference Vegetation Index (−1 to 1). |
| ndwi | float | Normalized Difference Water Index (−1 to 1). |
| water_presence_flag | string | Likely / unlikely surface water presence. |
| historical_flood_count | integer | Number of historical flood events (synthetic). |
| infrastructure_score | integer | Composite infrastructure score (0–100). |
| nearest_hospital_km | float | Distance to nearest hospital (km). |
| nearest_evac_km | float | Distance to nearest evacuation point (km). |
| flood_risk_score | float | Derived flood risk score (0–100) combining features. |
| flood_occurrence_current_event | string | Yes/No flag for the simulated current flood event. |
| inundation_area_sqm | integer | Estimated inundation area in square metres (0 if not flooded). |
| is_good_to_live | string | Yes/No recommendation based on risk and infrastructure. |
| reason_not_good_to_live | string | Explanation when is_good_to_live == No. |
| is_synthetic | boolean | Indicator that the dataset is synthetic. |
| generation_date | date | Generation timestamp (YYYY-MM-DD). |
Note: The dataset can be accessed at the following URL: https://www.kaggle.com/datasets/dewminimnaadi/sri-lanka-flood-risk-and-inundation-dataset.
As established in the preceding sections, the objective of this study
is to examine how environmental, geographical, and socio-demographic
factors influence the degree of flood exposure across Sri Lanka. To this
end, the analysis centres on predicting the continuous variable
flood_risk_score, an index ranging from 0 to 100
that encapsulates the estimated flood risk at each geographic
location.
This score is derived from multiple underlying attributes, including recent rainfall patterns, terrain elevation, proximity to rivers, vegetation indices, population density, and infrastructure characteristics. By modelling this variable, we aim to identify the most influential predictors of flood risk and to quantify their association with the observed spatial variability.
To prepare the dataset for the modelling stage, we first load the cleaned CSV file and then create an 80/20 train–test split.
For that aim, we rely on the initial_split() function
from the tidymodels package and apply stratification on
the response variable (flood_risk_score) to ensure that
both partitions preserve a similar distribution of the target.
# First, we set a seed to ensure reproducibility
set.seed(123)
# Loading the data
srilanka <- read_csv("sri_lanka_flood_risk_dataset_25000.csv")
# Obtaining the stratified 80/20 train–test split
data_split <- initial_split(
srilanka,
prop = 0.8,
strata = flood_risk_score
)
# Obtaining Train/Test Data
TrainData <- training(data_split)
TestData <- testing(data_split)
# We can perform a balance verification
tibble(
Set = c("Train", "Test"),
n = c(nrow(TrainData), nrow(TestData)),
mean_risk = c(mean(TrainData$flood_risk_score),
mean(TestData$flood_risk_score)),
sd_risk = c(sd(TrainData$flood_risk_score),
sd(TestData$flood_risk_score))
)## # A tibble: 2 × 4
## Set n mean_risk sd_risk
## <chr> <int> <dbl> <dbl>
## 1 Train 19998 33.3 14.8
## 2 Test 5002 33.2 14.7
## Rows: 19,998
## Columns: 32
## $ record_id <chr> "F100003", "F100005", "F100012", "F1000…
## $ district <chr> "Puttalam", "Galle", "Mannar", "Matara"…
## $ place_name <chr> "Mahagama East", "Udupola", "Galakanda"…
## $ latitude <dbl> 5.922365, 6.394456, 9.852201, 6.952797,…
## $ longitude <dbl> 81.48479, 80.45167, 80.81168, 80.66979,…
## $ elevation_m <dbl> 111, 34, 112, 153, 1473, 84, 53, 762, 7…
## $ distance_to_river_m <dbl> 2950.4, 2086.5, 631.7, 632.0, 880.1, 14…
## $ landcover <chr> "Urban", "Forest", "Urban", "Forest", "…
## $ soil_type <chr> "Sandy", "Silty", "Clay", "Sandy", "Loa…
## $ water_supply <chr> "Surface water", "Well", "Rainwater har…
## $ electricity <chr> "Grid", "Grid", "Off-grid (solar)", "Gr…
## $ road_quality <chr> "Good (paved)", "Good (paved)", "Fair",…
## $ population_density_per_km2 <dbl> 729, 1730, 261, 10, 1164, 543, 1191, 27…
## $ built_up_percent <dbl> 16.3, 27.6, 5.7, 17.8, 1.0, 2.5, 11.1, …
## $ urban_rural <chr> "Urban", "Urban", "Urban", "Rural", "Ur…
## $ rainfall_7d_mm <dbl> 10.1, 19.5, 50.8, 111.9, 32.8, 66.4, 58…
## $ monthly_rainfall_mm <dbl> 171.6, 93.4, 8.0, 2.9, 139.4, 170.1, 10…
## $ drainage_index <dbl> 0.288, 0.870, 0.376, 0.557, 0.251, 0.62…
## $ ndvi <dbl> 0.439, 0.666, -0.303, 0.732, 0.109, -0.…
## $ ndwi <dbl> -0.259, -0.212, -0.315, 0.343, -0.009, …
## $ water_presence_flag <chr> "Unlikely", "Unlikely", "Unlikely", "Li…
## $ historical_flood_count <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ infrastructure_score <dbl> 58, 45, 56, 27, 41, 76, 36, 30, 43, 33,…
## $ nearest_hospital_km <dbl> 11.44, 9.97, 10.71, 14.14, 6.06, 12.45,…
## $ nearest_evac_km <dbl> 4.94, 0.96, 4.85, 5.05, 7.65, 4.14, 4.3…
## $ flood_risk_score <dbl> 17.44, 12.71, 6.67, 14.05, 18.34, 17.09…
## $ flood_occurrence_current_event <chr> "No", "No", "No", "No", "No", "No", "No…
## $ inundation_area_sqm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ is_good_to_live <chr> "Yes", "Yes", "Yes", "No", "Yes", "Yes"…
## $ reason_not_good_to_live <chr> NA, NA, NA, "Poor infrastructure", NA, …
## $ is_synthetic <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
## $ generation_date <date> 2024-11-06, 2025-09-19, 2025-04-06, 20…
In order to understand the structure of the data before fitting any model, we first compute basic summary statistics for all variables in the training set. This allows us to obtain an overview of typical values, ranges and potential anomalies for the main features used in the analysis.
In parallel, we inspect the data types of each column to distinguish between continuous variables, categorical factors and boolean indicators, which will determine how these variables are treated in the subsequent linear regression models.
## record_id district place_name latitude
## Length:19998 Length:19998 Length:19998 Min. :5.900
## Class :character Class :character Class :character 1st Qu.:6.914
## Mode :character Mode :character Mode :character Median :7.927
## Mean :7.922
## 3rd Qu.:8.927
## Max. :9.950
## longitude elevation_m distance_to_river_m landcover
## Min. :79.65 Min. : 0 Min. : 0.0 Length:19998
## 1st Qu.:80.21 1st Qu.: 31 1st Qu.: 583.7 Class :character
## Median :80.76 Median : 68 Median : 1407.6 Mode :character
## Mean :80.77 Mean : 187 Mean : 2014.5
## 3rd Qu.:81.32 3rd Qu.: 124 3rd Qu.: 2785.7
## Max. :81.90 Max. :2148 Max. :16702.9
## soil_type water_supply electricity road_quality
## Length:19998 Length:19998 Length:19998 Length:19998
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## population_density_per_km2 built_up_percent urban_rural
## Min. : 10.0 Min. : 1.00 Length:19998
## 1st Qu.: 290.0 1st Qu.: 9.80 Class :character
## Median : 521.5 Median :23.30 Mode :character
## Mean : 627.5 Mean :25.17
## 3rd Qu.: 798.0 3rd Qu.:37.10
## Max. :3113.0 Max. :95.00
## rainfall_7d_mm monthly_rainfall_mm drainage_index ndvi
## Min. : 0.00 Min. : 0.0 Min. :0.0030 Min. :-0.791
## 1st Qu.: 33.10 1st Qu.:106.7 1st Qu.:0.3310 1st Qu.:-0.019
## Median : 69.50 Median :202.0 Median :0.5010 Median : 0.167
## Mean : 80.44 Mean :216.1 Mean :0.5009 Mean : 0.169
## 3rd Qu.:116.80 3rd Qu.:306.8 3rd Qu.:0.6730 3rd Qu.: 0.353
## Max. :372.20 Max. :863.7 Max. :0.9990 Max. : 1.000
## ndwi water_presence_flag historical_flood_count
## Min. :-1.00000 Length:19998 Min. :0.0000
## 1st Qu.:-0.14900 Class :character 1st Qu.:0.0000
## Median : 0.02600 Mode :character Median :0.0000
## Mean : 0.03081 Mean :0.2029
## 3rd Qu.: 0.20600 3rd Qu.:0.0000
## Max. : 1.00000 Max. :5.0000
## infrastructure_score nearest_hospital_km nearest_evac_km flood_risk_score
## Min. : 5.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.:32.00 1st Qu.: 2.250 1st Qu.: 1.730 1st Qu.: 22.54
## Median :46.00 Median : 5.520 Median : 4.160 Median : 32.30
## Mean :46.49 Mean : 7.966 Mean : 6.021 Mean : 33.27
## 3rd Qu.:60.00 3rd Qu.:11.080 3rd Qu.: 8.360 3rd Qu.: 42.94
## Max. :95.00 Max. :75.710 Max. :55.950 Max. :100.00
## flood_occurrence_current_event inundation_area_sqm is_good_to_live
## Length:19998 Min. : 0 Length:19998
## Class :character 1st Qu.: 0 Class :character
## Mode :character Median : 0 Mode :character
## Mean : 12381
## 3rd Qu.: 0
## Max. :309837
## reason_not_good_to_live is_synthetic generation_date
## Length:19998 Mode:logical Min. :2023-12-10
## Class :character TRUE:19998 1st Qu.:2024-06-08
## Mode :character Median :2024-12-10
## Mean :2024-12-08
## 3rd Qu.:2025-06-09
## Max. :2025-12-08
# Variable type table
var_types <- tibble(
Variable = names(TrainData),
Type = sapply(TrainData, function(x) class(x)[1])
)
kable(var_types,
caption = "Variable classes in the training dataset",
align = "ll",
row.names = FALSE) |>
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover"))| Variable | Type |
|---|---|
| record_id | character |
| district | character |
| place_name | character |
| latitude | numeric |
| longitude | numeric |
| elevation_m | numeric |
| distance_to_river_m | numeric |
| landcover | character |
| soil_type | character |
| water_supply | character |
| electricity | character |
| road_quality | character |
| population_density_per_km2 | numeric |
| built_up_percent | numeric |
| urban_rural | character |
| rainfall_7d_mm | numeric |
| monthly_rainfall_mm | numeric |
| drainage_index | numeric |
| ndvi | numeric |
| ndwi | numeric |
| water_presence_flag | character |
| historical_flood_count | numeric |
| infrastructure_score | numeric |
| nearest_hospital_km | numeric |
| nearest_evac_km | numeric |
| flood_risk_score | numeric |
| flood_occurrence_current_event | character |
| inundation_area_sqm | numeric |
| is_good_to_live | character |
| reason_not_good_to_live | character |
| is_synthetic | logical |
| generation_date | Date |
Overall, the training set consists of 19,998 observations and 32 variables. We can extract the following conclusions:
The summary statistics reveal that the main geographical
coordinates latitude and longitude span
relatively narrow ranges, consistently with the spatial extent of Sri
Lanka. Terrain elevation elevation_m varies from low-lying
areas close to sea level up to more than 2,100 metres, while
distance_to_river_m covers a wide interval from locations
directly adjacent to rivers to cells more than 16 km away.
Environmental indicators display substantial variability: recent
rainfall rainfall_7d_mm ranges from 0 to over 370 mm,
monthly rainfall monthly_rainfall_mm reaches almost 870 mm,
and the indices ndvi and ndwi cover most of
their theoretical support [-1,1], suggesting heterogeneous vegetation
and surface-water conditions across the study area.
Socio-demographic and exposure variables also show marked
dispersion. For example, population_density_per_km2 ranges
from sparsely populated cells with only 10 inhabitants per km² to highly
dense locations exceeding 3,100 inhabitants per km², and
built_up_percent spans from almost non-developed areas to
cells with 95% built-up surface.
The variable inundation_area_sqm is highly skewed,
with many zeros and a few very large values above 300,000 m², which
anticipates the need for special treatment if this variable is later
analysed.
The response variable flood_risk_score takes values
between 0 and 100, with a median around 32 and an upper quartile close
to 43, indicating that most locations exhibit low to moderate estimated
risk, while high-risk cells are comparatively less frequent.
The variable–type table clarifies the structure of the dataset.
Most predictors of interest are continuous numeric variables,
which are naturally suited for inclusion in linear regression models
(e.g. elevation_m, distance_to_river_m,
rainfall_7d_mm, monthly_rainfall_mm,
population_density_per_km2,
historical_flood_count, infrastructure_score,
etc.).
A second group comprises categorical variables stored as
character strings, such as district,
landcover, soil_type,
water_supply,electricity,road_quality,etc.
These must be converted to factors before modelling so that the
regression coefficients represent differences between
categories.
Finally, is_synthetic (logical) and
generation_date (date) play a descriptive or metadata role
and will not be used as predictors in the subsequent linear regression
analysis.
flood_risk_scoreBefore exploring relationships with the predictors, it is useful to
examine the marginal distribution of the response variable
flood_risk_score in the training set. This
helps to identify potential skewness, multimodality or extreme values
that may influence the behaviour of linear regression models and the
interpretation of their residuals.
ggplot(TrainData, aes(x = flood_risk_score)) +
geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
labs(
title = "Distribution of the flood risk score",
x = "Flood risk score",
y = "Density"
) +
theme_minimal(base_size = 13)The distribution of flood_risk_score in the training set
exhibits a unimodal and approximately bell-shaped pattern, with most
observations concentrated between 15 and 45. The density curve peaks
around the upper 20s to lower 30s, which is consistent with the median
of 32.3 reported in the summary statistics. The right tail extends
gradually toward higher scores, indicating the presence of a smaller
proportion of locations classified as high-risk (scores above 60).
Conversely, very low values close to zero are relatively uncommon,
suggesting that few areas are assigned negligible flood risk. Overall,
the distribution is only mildly right-skewed and does not display sharp
discontinuities or multimodality, making flood_risk_score
suitable as a continuous response variable for linear modelling.
Given the approximately symmetric and well-behaved distribution of
flood_risk_score, no transformation of the response
variable is required prior to fitting linear regression models. This
decision may be revisited after inspecting model residuals, but the
current distribution does not indicate the need for a logarithmic or
Box–Cox transformation.
To explore the marginal behaviour of the most informative numerical
predictors, we begin by analysing the distributions of three continuous
variables that have a strong theoretical relationship with flood
exposure: elevation_m, distance_to_river_m and
rainfall_7d_mm. These predictors cover complementary
physical dimensions relevant to flood processes:
elevation_m reflects terrain height, which is a
primary determinant of water accumulation. Lower elevations generally
correspond to a higher likelihood of flooding.
distance_to_river_m captures proximity to
hydrological networks. Locations closer to rivers tend to exhibit
greater flood susceptibility due to overflow and channel expansion
during heavy rainfall.
rainfall_7d_mm represents short-term precipitation
intensity. High recent rainfall levels increase soil saturation and
runoff, amplifying flood risk.
By examining these variables individually, we gain insight into their scale, variability and potential skewness, which helps anticipate how they may influence the performance and assumptions of linear regression models.
p1 <- ggplot(TrainData, aes(x = elevation_m)) +
geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
labs(title = "Distribution of elevation",
x = "Elevation (m)", y = "Density") +
theme_minimal(base_size = 13)
p2 <- ggplot(TrainData, aes(x = distance_to_river_m)) +
geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
labs(title = "Distribution of distance to nearest river",
x = "Distance (m)", y = "Density") +
theme_minimal(base_size = 13)
p3 <- ggplot(TrainData, aes(x = rainfall_7d_mm)) +
geom_density(fill = "skyblue", alpha = 0.6, linewidth = 1) +
labs(title = "Distribution of rainfall over the last 7 days",
x = "Rainfall (mm)", y = "Density") +
theme_minimal(base_size = 13)
p1; p2; p3The univariate distributions of the selected continuous predictors reveal markedly different patterns that reflect the underlying physical processes related to flood exposure.
elevation_m is highly right-skewed, with
the vast majority of locations concentrated at low altitudes, which is
consistent with the geographical profile of Sri Lanka.distance_to_river_m decreases sharply near
zero, indicating that many spatial units lie close to river channels, a
factor that may substantially influence flood susceptibility.rainfall_7d_mm exhibits a moderately
right-skewed distribution, with most observations falling below 150 mm
but a non-negligible tail representing intense rainfall events.To complement the analysis of continuous predictors, we now examine
the marginal behaviour of three categorical variables that capture
relevant land-use and socio-geographical characteristics:
landcover, urban_rural and
soil_type. These variables describe, respectively, the
dominant surface cover of each location, whether the area is classified
as urban or rural, and the prevailing soil category.
All three factors may influence flood dynamics through their effects
on infiltration capacity, runoff generation and human exposure. For
example, impermeable urban surfaces or certain soil types can reduce
infiltration and increase surface runoff, whereas vegetated or permeable
areas may attenuate flood peaks. Understanding the distribution of these
categories provides useful context for interpreting their potential
contribution to the variability of flood_risk_score.
p4 <- ggplot(TrainData, aes(x = landcover)) +
geom_bar(fill = "skyblue", alpha = 0.7) +
labs(
title = "Distribution of land-cover categories",
x = "Land-cover type",
y = "Count"
) +
theme_minimal(base_size = 13) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
p5 <- ggplot(TrainData, aes(x = urban_rural)) +
geom_bar(fill = "skyblue", alpha = 0.7) +
labs(
title = "Distribution of urban vs. rural locations",
x = "Category",
y = "Count"
) +
theme_minimal(base_size = 13)
p6 <- ggplot(TrainData, aes(x = soil_type)) +
geom_bar(fill = "skyblue", alpha = 0.7) +
labs(
title = "Distribution of soil_type categories",
x = "Soil type",
y = "Count"
) +
theme_minimal(base_size = 13)
p4; p5; p6Before fitting the linear regression models, it is necessary to
identify which variables are appropriate to use as predictors of
flood_risk_score. Several attributes in the dataset are
unsuitable due to conceptual, statistical, or methodological
considerations. Below, we explain the rationale for excluding specific
variables.
generation_date: this is a metadata field describing
when each synthetic observation was generated. It contains no
geographical or environmental information relevant to flood risk.is_synthetic: represents a constant attribute equal to
TRUE for all observations. As it lacks variability, it cannot contribute
to model estimation.is_good_to_live and
reason_not_good_to_live: these variables represent derived
recommendations partly based on flood risk itself. Including them would
introduce circularity and bias, as they indirectly encode the target
variable.flood_occurrence_current_event: reflects participation
in a simulated current flood event. As it depends on extreme-event
conditions rather than long-term risk, including it would distort the
modelling of flood_risk_score.inundation_area_sqm: represents the simulated inundated
area for the current event. Its strong dependence on the same mechanisms
that generate the risk score makes it unsuitable due to potential data
leakage.longitude and latitude: although spatial
position is relevant for flood modelling, raw coordinates introduce a
highly non-linear spatial structure that is poorly captured by simple
linear regression. These variables also risk acting as proxies for
unobserved processes, causing unstable and difficult-to-interpret
coefficients.district: contains many categories and would require
generating a large set of dummy variables. Such encoding increases model
complexity, introduces sparse estimations, and may overshadow the effect
of meaningful environmental variables.Moreover, it is important to verify whether any missing values are present among the selected predictors or in the response variable. Detecting and addressing missing data at this stage ensures the integrity of the regression analysis and prevents potential biases or inconsistencies in the model estimates.
# List of selected continuous predictors
vars <- c(
"elevation_m",
"distance_to_river_m",
"rainfall_7d_mm",
"monthly_rainfall_mm",
"drainage_index",
"ndvi",
"ndwi",
"population_density_per_km2",
"built_up_percent",
"infrastructure_score",
"nearest_hospital_km",
"nearest_evac_km",
"historical_flood_count"
)
# Add the response variable to the list
vars_full <- c(vars, "flood_risk_score")
# Subset the dataset to only the variables of interest
subset_df <- TrainData[, vars_full]
# Compute missing values for each selected variable
missing_counts <- sapply(subset_df, function(x) sum(is.na(x)))
# Display the results as a data frame for readability
missing_table <- data.frame(
Variable = names(missing_counts),
Missing_Values = missing_counts
)
missing_table## Variable Missing_Values
## elevation_m elevation_m 0
## distance_to_river_m distance_to_river_m 0
## rainfall_7d_mm rainfall_7d_mm 0
## monthly_rainfall_mm monthly_rainfall_mm 0
## drainage_index drainage_index 0
## ndvi ndvi 0
## ndwi ndwi 0
## population_density_per_km2 population_density_per_km2 0
## built_up_percent built_up_percent 0
## infrastructure_score infrastructure_score 0
## nearest_hospital_km nearest_hospital_km 0
## nearest_evac_km nearest_evac_km 0
## historical_flood_count historical_flood_count 0
## flood_risk_score flood_risk_score 0
To complement the univariate analysis, we now examine how each
predictor relates to the response variable
flood_risk_score. This bivariate exploration is essential
for identifying the strength, direction, and linearity of potential
associations, and for anticipating modelling challenges such as
heteroscedasticity or non-linear patterns.
We organise the analysis into two parts:
Continuous predictors: scatterplots with linear
smoothing lines allow us to visually assess whether variables such as
elevation_m, distance_to_river_m,
rainfall_7d_mm, etc. display monotonic trends with
flood_risk_score.
Categorical predictors: predictors such as
landcover, soil_type,
urban_rural, water_supply, etc. are explored
using boxplots and distribution comparisons, which reveal whether
different categories correspond to systematically higher or lower flood
risk.
We begin with environmental predictors that directly influence flood generation mechanisms.
# Custom theme
custom_theme <- theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", colour = "#003366"),
axis.title = element_text(colour = "grey20"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85")
)
# Lighter points, darker line
pt_col <- "#6baed6" # light blue for points
smooth_col <- "#08306b" # very dark blue for line
p1 <- ggplot(TrainData, aes(x = elevation_m, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "Elevation vs. flood risk",
x = "Elevation (m)",
y = "Flood risk score"
) +
custom_theme
p2 <- ggplot(TrainData, aes(x = distance_to_river_m, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "Distance to river vs. flood risk",
x = "Distance to nearest river (m)",
y = "Flood risk score"
) +
custom_theme
p3 <- ggplot(TrainData, aes(x = drainage_index, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "Drainage index vs. flood risk",
x = "Drainage index (0–1)",
y = "Flood risk score"
) +
custom_theme
(p1 | p2) /
(p3 | patchwork::plot_spacer())Rainfall intensity and vegetation indices reflect hydrometeorological
and land-surface conditions that strongly modulate the buildup,
absorption and runoff of water. Therefore, examining their relationships
with flood_risk_score helps reveal how recent precipitation
and surface characteristics contribute to flood exposure patterns.
# Reusing the custom theme and colors defined before
custom_theme <- theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", colour = "#003366"),
axis.title = element_text(colour = "grey20"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85")
)
pt_col <- "#6baed6" # light blue points
smooth_col <- "#08306b" # dark blue regression line
# Rainfall 7 days
p4 <- ggplot(TrainData, aes(x = rainfall_7d_mm, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "Rainfall (7d) vs. flood risk",
x = "Rainfall in the last 7 days (mm)",
y = "Flood risk score"
) +
custom_theme
# Monthly Rainfall
p5 <- ggplot(TrainData, aes(x = monthly_rainfall_mm, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "Monthly rainfall vs. flood risk",
x = "Monthly rainfall (mm)",
y = "Flood risk score"
) +
custom_theme
# NDVI
p6 <- ggplot(TrainData, aes(x = ndvi, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "NDVI vs. flood risk",
x = "NDVI (−1 to 1)",
y = "Flood risk score"
) +
custom_theme
# NDWI
p7 <- ggplot(TrainData, aes(x = ndwi, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.4) +
labs(
title = "NDWI vs. flood risk",
x = "NDWI (−1 to 1)",
y = "Flood risk score"
) +
custom_theme
# Combine into a 2×2 panel
(p4 | p5) /
(p6 | p7)The following panel displays scatterplots for demographic intensity, built-up area, accessibility to emergency services and historical flood counts against the response variable, providing insight into how these socio-environmental factors relate to estimated flood risk.
# Reusing the custom theme and colors defined before
custom_theme <- theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", colour = "#003366"),
axis.title = element_text(colour = "grey20"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85")
)
pt_col <- "#6baed6" # light blue points
smooth_col <- "#08306b" # dark blue regression line
# Population density
p8 <- ggplot(TrainData, aes(x = population_density_per_km2, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Population density",
x = "People per km²",
y = "Flood risk score"
) +
custom_theme
# Built-up percent
p9 <- ggplot(TrainData, aes(x = built_up_percent, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Built-up area",
x = "Built-up area (%)",
y = "Flood risk score"
) +
custom_theme
# Infrastructure score
p10 <- ggplot(TrainData, aes(x = infrastructure_score, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Infrastructure score",
x = "Score (0–100)",
y = "Flood risk score"
) +
custom_theme
# Distance to nearest hospital
p11 <- ggplot(TrainData, aes(x = nearest_hospital_km, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Distance to hospital",
x = "km",
y = "Flood risk score"
) +
custom_theme
# Distance to nearest evacuation point
p12 <- ggplot(TrainData, aes(x = nearest_evac_km, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Distance to evacuation pt",
x = "km",
y = "Flood risk score"
) +
custom_theme
# Historical flood count
p13 <- ggplot(TrainData, aes(x = historical_flood_count, y = flood_risk_score)) +
geom_point(alpha = 0.20, colour = pt_col) +
geom_smooth(method = "lm", se = FALSE, colour = smooth_col, linewidth = 1.2) +
labs(
title = "Historical flood count",
x = "Number of past floods",
y = "Flood risk score"
) +
custom_theme
# 3×2 layout
(p8 | p9 | p10) /
(p11 | p12 | p13)Categorical predictors describe qualitative characteristics of each
location, such as land use, soil properties or access to basic services.
To investigate how these factors relate to
flood_risk_score, we compare the distribution of the
response across categories of landcover,
soil_type, urban_rural,
water_supply, electricity and
road_quality using boxplots. These visualisations indicate
whether certain classes are systematically associated with higher or
lower estimated flood risk.
custom_theme <- theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", colour = "#003366", size = 11),
axis.title = element_text(colour = "grey20"),
axis.text.x = element_text(angle = 25, hjust = 1),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = "grey85")
)
box_col <- "#6baed6"
p_cat1 <- ggplot(TrainData, aes(x = landcover, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Landcover", x = "Land-cover type", y = "Flood risk score") +
custom_theme
p_cat2 <- ggplot(TrainData, aes(x = soil_type, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Soil type", x = "Soil type", y = "Flood risk score") +
custom_theme
p_cat3 <- ggplot(TrainData, aes(x = urban_rural, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Urban vs rural", x = "Category", y = "Flood risk score") +
custom_theme
p_cat4 <- ggplot(TrainData, aes(x = water_supply, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Water supply", x = "Water supply source", y = "Flood risk score") +
custom_theme
p_cat5 <- ggplot(TrainData, aes(x = electricity, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Electricity", x = "Electricity availability", y = "Flood risk score") +
custom_theme
p_cat6 <- ggplot(TrainData, aes(x = road_quality, y = flood_risk_score)) +
geom_boxplot(fill = box_col, alpha = 0.7, outlier.alpha = 0.4) +
labs(title = "Road quality", x = "Road quality", y = "Flood risk score") +
custom_theme
(p_cat1 | p_cat2 | p_cat3) /
(p_cat4 | p_cat5 | p_cat6)The exploratory analysis reveals a diverse set of relationships
between the predictors and the response variable
flood_risk_score. The main findings can be summarised as
follows:
Geographical predictors:
elevation_m and distance_to_river_m
display weak but interpretable negative associations with flood risk.
Lower elevations and locations closer to rivers tend to exhibit higher
flood exposure, consistent with hydrological expectations.Meteorological variables:
rainfall_7d_mm and monthly_rainfall_mm show
clear positive linear trends with the response, indicating that recent
and accumulated precipitation substantially increases estimated flood
risk.
Vegetation indices ndvi and
ndwi exhibit negligible linear relationships, suggesting
that surface-cover characteristics exert only modest influence in the
synthetic risk model.
Socio-environmental predictors:
built_up_percent and
population_density_per_km2 show very weak positive
associations with flood risk.infrastructure_score shows a slight negative
association, consistent with better infrastructure contributing to
reduced vulnerability.nearest_hospital_km and
nearest_evac_km show almost no linear pattern.historical_flood_count stands out with a clearly
increasing trend, making it one of the strongest predictors in the
dataset.Categorical predictors:
Before fitting linear regression models, it is essential to evaluate whether strong linear relationships exist among the selected continuous predictors. High multicollinearity can inflate coefficient variances, reduce model interpretability, and weaken statistical inference.
We restrict the analysis to the continuous predictors selected in the previous subsection:
elevation_mdistance_to_river_mrainfall_7d_mmmonthly_rainfall_mmdrainage_indexndvindwipopulation_density_per_km2built_up_percentinfrastructure_scorenearest_hospital_kmnearest_evac_kmhistorical_flood_count# Step 1: We define the list of continuous predictor names
vars <- c(
"elevation_m",
"distance_to_river_m",
"rainfall_7d_mm",
"monthly_rainfall_mm",
"drainage_index",
"ndvi",
"ndwi",
"population_density_per_km2",
"built_up_percent",
"infrastructure_score",
"nearest_hospital_km",
"nearest_evac_km",
"historical_flood_count"
)
# Step 2: We select these variables from the training dataset
cont_vars <- TrainData |>
dplyr::select(dplyr::all_of(vars))
# Step 3: We compute the Pearson correlation matrix
cor_matrix <- cor(
cont_vars,
use = "pairwise.complete.obs",
method = "pearson"
)
# Print the numeric matrix rounded to 2 decimals to inspect it
# round(cor_matrix, 2)
# Step 4: We reshape the matrix to a long format suitable for ggplot2
cor_df <- as.data.frame(as.table(cor_matrix))
names(cor_df) <- c("Var1", "Var2", "Correlation")
# Step 5: We create a custom heatmap with ggplot
ggplot(cor_df, aes(x = Var1, y = Var2, fill = Correlation)) +
geom_tile() +
geom_text(aes(label = sprintf("%.2f", Correlation)), size = 3) +
scale_fill_gradient2(
low = "#c6dbef",
mid = "white",
high = "#08306b",
midpoint = 0,
limits = c(-1, 1)
) +
coord_equal() +
labs(
title = "Correlation matrix of continuous predictors",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank()
)The correlation matrix of the 13 continuous predictors shows that overall multicollinearity is extremely low:
rainfall_7d_mm, monthly_rainfall_mm,
distance_to_river_mand elevation_m could
theoretically show moderate correlations, but in this synthetic dataset
they are almost independent.historical_flood_count are also
weakly related to geographic or meteorological predictors, reflecting
the dataset’s construction.Before specifying the linear regression models, and completing the
work performed in the previous section, it is useful to quantify the
strength of the association between each predictor and the response
variable flood_risk_score.
Evaluating these associations provides an objective foundation for selecting predictors, identifying irrelevant variables, and anticipating potential modelling challenges.
Given that the dataset contains both continuous and categorical predictors, we evaluate their relationships with the response using appropriate measures:
flood_risk_score
explained by category membership.# List of continuous predictors
cont_predictors <- c(
"elevation_m", "distance_to_river_m", "rainfall_7d_mm", "monthly_rainfall_mm",
"drainage_index", "ndvi", "ndwi", "population_density_per_km2",
"built_up_percent", "infrastructure_score",
"nearest_hospital_km", "nearest_evac_km", "historical_flood_count"
)
# List of categorical predictors
cat_predictors <- c("landcover", "soil_type", "urban_rural",
"water_supply", "electricity", "road_quality")
# Continuous predictors: correlations
cont_assoc <- cont_predictors |>
lapply(function(v) {
cor_val <- cor(TrainData[[v]], TrainData$flood_risk_score,
use = "pairwise.complete.obs", method = "pearson")
tibble(Predictor = v, Type = "Continuous", Association = cor_val)
}) |>
bind_rows()
# Categorical predictors: ANOVA effect size (Eta-squared)
cat_assoc <- cat_predictors |>
lapply(function(v) {
# I build the formula flood_risk_score ~ <predictor>
fmla <- as.formula(paste("flood_risk_score ~", v))
model <- aov(fmla, data = TrainData)
eta_sq <- etaSquared(model, anova = TRUE)[1, "eta.sq"]
tibble(Predictor = v, Type = "Categorical", Association = eta_sq)
}) |>
bind_rows()
# Combining and displaying the results
summary_table <- bind_rows(cont_assoc, cat_assoc)
kable(
summary_table,
caption = "Association between predictors and flood risk score",
digits = 3
)| Predictor | Type | Association |
|---|---|---|
| elevation_m | Continuous | -0.112 |
| distance_to_river_m | Continuous | -0.147 |
| rainfall_7d_mm | Continuous | 0.409 |
| monthly_rainfall_mm | Continuous | 0.769 |
| drainage_index | Continuous | -0.146 |
| ndvi | Continuous | -0.013 |
| ndwi | Continuous | 0.002 |
| population_density_per_km2 | Continuous | 0.007 |
| built_up_percent | Continuous | 0.060 |
| infrastructure_score | Continuous | -0.142 |
| nearest_hospital_km | Continuous | -0.002 |
| nearest_evac_km | Continuous | 0.002 |
| historical_flood_count | Continuous | 0.229 |
| landcover | Categorical | 0.000 |
| soil_type | Categorical | 0.000 |
| urban_rural | Categorical | 0.000 |
| water_supply | Categorical | 0.000 |
| electricity | Categorical | 0.001 |
| road_quality | Categorical | 0.003 |
For categorical variables, the η² values are close to zero, indicating that group membership (e.g., land-cover class, soil type, urban/rural status, access to services) explains very little variance in the response variable when considered individually. This suggests that their influence is limited in a marginal (univariate) sense and may become relevant only when combined with other predictors in a multivariate modelling context.
Given the correlation analysis conducted in the previous subsection,
we now refine our focus to those continuous predictors that exhibited
the strongest marginal associations with the response variable
flood_risk_score.
In particular, monthly_rainfall_mm,
rainfall_7d_mm,
historical_flood_count,distance_to_river_m,
drainage_index, infrastructure_score and
elevation_m demonstrated the largest absolute correlations,
indicating a potentially meaningful linear relationship with the flood
risk score when considered individually.
To further quantify the explanatory power of these predictors, we fit a series of simple linear regression models of the form
\[ flood\_risk\_score = \beta_0 + \beta_1 X + \varepsilon , \]
where \(X\) denotes each selected predictor in isolation.
This step serves two purposes:
Provide an interpretable measure of the individual effect of each predictor on the response, free from confounding by other variables.
Establish a baseline for comparison before transitioning to multivariate linear regression models that incorporate multiple predictors simultaneously.
By analysing the estimated coefficients, statistical significance and \(R^2\) values of these simple models, we obtain a clearer picture of which environmental and socio-hydrological factors have the strongest direct influence on flood risk in a univariate setting.
# Simple linear regression: flood_risk_score ~ monthly_rainfall_mm
lm_monthly <- lm(flood_risk_score ~ monthly_rainfall_mm, data = TrainData)
summary(lm_monthly)##
## Call:
## lm(formula = flood_risk_score ~ monthly_rainfall_mm, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.211 -7.214 -0.934 5.997 39.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.536e+01 1.248e-01 123.1 <2e-16 ***
## monthly_rainfall_mm 8.287e-02 4.873e-04 170.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.47 on 19996 degrees of freedom
## Multiple R-squared: 0.5912, Adjusted R-squared: 0.5912
## F-statistic: 2.892e+04 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ rainfall_7d_mm
lm_rain7d <- lm(flood_risk_score ~ rainfall_7d_mm, data = TrainData)
summary(lm_rain7d)##
## Call:
## lm(formula = flood_risk_score ~ rainfall_7d_mm, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.875 -9.942 -0.985 8.752 63.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.954723 0.162528 153.5 <2e-16 ***
## rainfall_7d_mm 0.103423 0.001634 63.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.52 on 19996 degrees of freedom
## Multiple R-squared: 0.1669, Adjusted R-squared: 0.1669
## F-statistic: 4006 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ historical_flood_count
lm_hist <- lm(flood_risk_score ~ historical_flood_count, data = TrainData)
summary(lm_hist)##
## Call:
## lm(formula = flood_risk_score ~ historical_flood_count, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.532 -10.510 -0.895 9.469 61.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.9151 0.1099 290.49 <2e-16 ***
## historical_flood_count 6.6972 0.2016 33.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.42 on 19996 degrees of freedom
## Multiple R-squared: 0.05231, Adjusted R-squared: 0.05226
## F-statistic: 1104 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ distance_to_river_m
lm_dist <- lm(flood_risk_score ~ distance_to_river_m, data = TrainData)
summary(lm_dist)##
## Call:
## lm(formula = flood_risk_score ~ distance_to_river_m, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.507 -10.631 -0.965 9.544 71.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.547e+01 1.470e-01 241.25 <2e-16 ***
## distance_to_river_m -1.090e-03 5.178e-05 -21.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.65 on 19996 degrees of freedom
## Multiple R-squared: 0.02167, Adjusted R-squared: 0.02162
## F-statistic: 442.9 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ drainage_index
lm_drain <- lm(flood_risk_score ~ drainage_index, data = TrainData)
summary(lm_drain)##
## Call:
## lm(formula = flood_risk_score ~ drainage_index, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.178 -10.687 -0.956 9.470 67.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.1005 0.2541 149.9 <2e-16 ***
## drainage_index -9.6346 0.4632 -20.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.65 on 19996 degrees of freedom
## Multiple R-squared: 0.02118, Adjusted R-squared: 0.02113
## F-statistic: 432.6 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ infrastructure_score
lm_infra <- lm(flood_risk_score ~ infrastructure_score, data = TrainData)
summary(lm_infra)##
## Call:
## lm(formula = flood_risk_score ~ infrastructure_score, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.882 -10.657 -0.916 9.548 66.670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.163818 0.262911 145.16 <2e-16 ***
## infrastructure_score -0.105187 0.005197 -20.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.66 on 19996 degrees of freedom
## Multiple R-squared: 0.02007, Adjusted R-squared: 0.02002
## F-statistic: 409.6 on 1 and 19996 DF, p-value: < 2.2e-16
# Simple linear regression: flood_risk_score ~ elevation_m
lm_elev <- lm(flood_risk_score ~ elevation_m, data = TrainData)
summary(lm_elev)##
## Call:
## lm(formula = flood_risk_score ~ elevation_m, data = TrainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.029 -10.706 -0.869 9.639 65.782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.2338398 0.1202680 284.65 <2e-16 ***
## elevation_m -0.0051327 0.0003222 -15.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.72 on 19996 degrees of freedom
## Multiple R-squared: 0.01253, Adjusted R-squared: 0.01248
## F-statistic: 253.7 on 1 and 19996 DF, p-value: < 2.2e-16
The following conclusions are drawn from the simple linear regression
models fitted using the predictors that showed the strongest marginal
associations with the response variable
flood_risk_score:
monthly_rainfall_mm
rainfall_7d_mm
historical_flood_count
distance_to_river_m
drainage_index
distance_to_river_m
(\(R²\) ≈ 0.021), reinforcing the idea
that infiltration and terrain drainage capacity mitigate floodwater
accumulation.infrastructure_score
elevation_m
Overall, these results show that
monthly_rainfall_mm is by far the strongest marginal
predictor of flood_risk_score, while
rainfall_7d_mm and historical_flood_count
provide more moderate contributions.
We now proceed to construct a series of multiple linear regression models and to evaluate their predictive performance on the hold-out test set.
Comparing the models on both explanatory power and out-of-sample accuracy will guide the selection of a final regression specification for detailed diagnostic analysis.
We start with a full specification that includes all continuous
predictors which exhibited meaningful marginal associations with the
response variable flood_risk_score:
monthly_rainfall_mmrainfall_7d_mmhistorical_flood_countdistance_to_river_mdrainage_indexinfrastructure_scoreelevation_m# Model 1: Full relevant model
lm_full_rel <- lm(
flood_risk_score ~ monthly_rainfall_mm +
rainfall_7d_mm +
historical_flood_count +
distance_to_river_m +
drainage_index +
infrastructure_score +
elevation_m,
data = TrainData
)
# Test-set predictions and performance
pred_full_rel <- predict(lm_full_rel, newdata = TestData)
R2_test_full_rel <- cor(TestData$flood_risk_score, pred_full_rel)^2
RMSE_test_full_rel <- sqrt(
mean((TestData$flood_risk_score - pred_full_rel)^2)
)We next consider a more parsimonious specification that retains only
the 3 strongest predictors identified in the exploratory and simple
regression analyses. This model incorporates the two rainfall variables
monthly_rainfall_mm and rainfall_7d_mm
together with historical_flood_count, capturing the
dominant meteorological drivers and the accumulated vulnerability of
each location. By focusing on these three predictors, we aim to evaluate
whether a simpler formulation can achieve competitive predictive
accuracy while improving interpretability.
# Model 2: Strong rainfall + history
lm_strong3 <- lm(
flood_risk_score ~ monthly_rainfall_mm +
rainfall_7d_mm +
historical_flood_count,
data = TrainData
)
pred_strong3 <- predict(lm_strong3, newdata = TestData)
R2_test_strong3 <- cor(TestData$flood_risk_score, pred_strong3)^2
RMSE_test_strong3 <- sqrt(mean((TestData$flood_risk_score - pred_strong3)^2))We also evaluate a simplified rainfall-only specification, which
includes monthly_rainfall_mm and
rainfall_7d_mm as the sole predictors. This model isolates
the direct meteorological drivers of flood risk to assess how much
predictive power can be achieved without incorporating additional
environmental or structural factors.
# Model 3: Rainfall-only model
lm_rain <- lm(
flood_risk_score ~ monthly_rainfall_mm +
rainfall_7d_mm,
data = TrainData
)
pred_rain <- predict(lm_rain, newdata = TestData)
R2_test_rain <- cor(TestData$flood_risk_score, pred_rain)^2
RMSE_test_rain <- sqrt(mean((TestData$flood_risk_score - pred_rain)^2))We further estimate a hydro-geomorphological model that integrates
key physical determinants of flood generation. In addition to the two
rainfall measures, this specification includes
distance_to_river_m, drainage_index and
elevation_m, allowing us to capture the combined influence
of precipitation, terrain structure, and hydrological context on flood
risk.
# Model 4: Hydro-geomorphological model
lm_hydro <- lm(
flood_risk_score ~ monthly_rainfall_mm +
rainfall_7d_mm +
distance_to_river_m +
drainage_index +
elevation_m,
data = TrainData
)
pred_hydro <- predict(lm_hydro, newdata = TestData)
R2_test_hydro <- cor(TestData$flood_risk_score, pred_hydro)^2
RMSE_test_hydro <- sqrt(mean((TestData$flood_risk_score - pred_hydro)^2))Finally, we estimate a comprehensive model that incorporates all available predictors, including both continuous and categorical variables.
# Model 5: All predictors (continuous + categorical)
lm_all <- lm(
flood_risk_score ~ elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index +
ndvi +
ndwi +
population_density_per_km2 +
built_up_percent +
infrastructure_score +
nearest_hospital_km +
nearest_evac_km +
historical_flood_count +
landcover +
soil_type +
urban_rural +
water_supply +
electricity +
road_quality,
data = TrainData
)
# Test-set predictive performance
pred_all <- predict(lm_all, newdata = TestData)
R2_test_all <- cor(TestData$flood_risk_score, pred_all)^2
RMSE_test_all <- sqrt(mean((TestData$flood_risk_score - pred_all)^2))Note: Categorical predictors do not require explicit
encoding when fitting classical multiple linear regression models using
lm(), as R automatically applies appropriate dummy-variable
coding.
We now summarise the predictive performance of all fitted multiple linear regression models by compiling their test-set \(R^2\) and \(RMSE\) values into a single comparison table.
# Collect test-set performance
mlr_perf <- tibble(
Model = c("Full relevant",
"Strong rainfall + history",
"Rainfall only",
"Hydro-geomorphological",
"All predictors"),
R2_test = c(R2_test_full_rel,
R2_test_strong3,
R2_test_rain,
R2_test_hydro,
R2_test_all),
RMSE_test = c(RMSE_test_full_rel,
RMSE_test_strong3,
RMSE_test_rain,
RMSE_test_hydro,
RMSE_test_all)
)
# Print as a clean table (no tibble header)
as.data.frame(mlr_perf)## Model R2_test RMSE_test
## 1 Full relevant 0.8591315 5.506455
## 2 Strong rainfall + history 0.7930970 6.672519
## 3 Rainfall only 0.7455199 7.399476
## 4 Hydro-geomorphological 0.7995043 6.568641
## 5 All predictors 0.8634650 5.421057
The comparison of the alternative multiple linear regression models on the test set leads to the following conclusions:
The “All predictors” model achieves the best predictive performance, with the highest test-set \(R^2 \approx 0.863\) and the lowest RMSE (≈ 5.42). The improvement over the “Full relevant” model is modest but consistent, suggesting that adding the remaining predictors and categorical factors yields a small yet measurable gain in accuracy.
The “Full relevant” model (only continuous predictors with stronger marginal associations) performs almost as well (\(R^2 \approx 0.859\), RMSE ≈ 5.51). This indicates that most of the predictive power is already captured by the key environmental and hydrometeorological variables identified in the exploratory analysis.
Simpler specifications that rely solely on rainfall (“Rainfall only”) or rainfall plus historical events (“Strong rainfall + history”) show clearly lower \(R^2\) (between 0.74 and 0.79) and higher RMSE, confirming that incorporating additional terrain and infrastructural information substantially improves out-of-sample predictions.
The “Hydro-geomorphological” model, which combines rainfall with distance to rivers, drainage and elevation, performs better than the rainfall-only model but remains slightly inferior to the full continuous and all-predictor models. This suggests that, while hydro-geomorphological factors are important, they do not fully substitute the information carried by historical events or socio-environmental variables.
Overall, these results justify selecting the “All predictors” model as the final specification for diagnostic analysis, while noting that the “Full relevant” model offers a competitive and more parsimonious alternative with very similar predictive performance.
We now assess whether the final selected model (“All predictors”) satisfies the classical linear regression assumptions by examining the behaviour of the residuals and identifying potential influential observations.
We begin by checking homoscedasticity, the assumption that the variance of residuals is constant across fitted values. This is inspected using a plot of standardized residuals versus fitted values, where a random cloud of points around zero would indicate no visible pattern.
To assess independence, we inspect the residuals across the observation index. A lack of structure in this plot would suggest that the residuals do not exhibit serial dependence.
par(mfrow = c(1, 2))
resid_std <- rstandard(lm_all)
# Plot 1: Residuals vs Fitted
plot(fitted(lm_all), resid_std,
xlab = "Fitted values",
ylab = "Standardized residuals",
main = "Residuals vs Fitted",
pch = 16, col = "darkred", cex = 0.65)
abline(h = 0, col = "gray40", lty = 2, lwd = 1.3)
# Plot 2: Residuals vs Observation Index
plot(1:length(resid_std), resid_std,
xlab = "Observation Index",
ylab = "Standardized residuals",
main = "Residuals vs Obs. Index",
pch = 16, col = "darkblue", cex = 0.65)
abline(h = 0, col = "gray40", lty = 2, lwd = 1.3)The Residuals vs Fitted plot shows a broadly symmetric cloud around zero with no strong funnel shape, suggesting that variance is approximately constant across fitted values. Minor deviations are present but not severe.
The Residuals vs Observation Index plot does not exhibit visible time-like patterns or clustering, indicating no obvious serial dependence.
To assess whether residuals follow a normal distribution, we use a Q-Q plot. A linear pattern in the Q-Q plot support the assumption of normality.
res <- resid(lm_all)
qqnorm(res, main = "Normal Q–Q Plot of Residuals")
qqline(res, col = "blue", lwd = 2)The Q–Q plot shows a largely linear pattern in the central region, with moderate deviations in the tails, indicating that the residuals follow an approximately normal distribution.
The Scale–Location plot is used to further assess the homoscedasticity assumption by visualising whether the spread of residuals remains constant across fitted values. A roughly horizontal pattern indicates stable variance, while systematic trends suggest heteroscedasticity.
# Compute standardized residuals
resid_std <- rstandard(lm_all)
# Plot: sqrt(|standardized residuals|) vs fitted values
plot(
fitted(lm_all),
sqrt(abs(resid_std)),
xlab = "Fitted values",
ylab = "Sqrt(|Standardized residuals|)",
main = "Scale–Location Plot",
pch = 16,
col = "darkgreen",
cex = 0.65
)
# Add a smooth trend line
lines(
lowess(fitted(lm_all), sqrt(abs(resid_std))),
col = "black",
lwd = 2
)In this case, the Scale–Location plot shows an approximately horizontal trend with no pronounced funnel shape, indicating that the variance of the residuals remains reasonably stable across fitted values. This suggests that the assumption of homoscedasticity is largely satisfied for the final model.
The Residuals vs Leverage plot is used to identify potentially influential observations that may disproportionately affect the fitted regression model. Points with high leverage and large residuals warrant closer inspection.
We also assess influence diagnostics by visualizing standard influence measures for all observations, in order to identify potential high-leverage or influential points that may affect the stability of the fitted model.
The Residuals vs Leverage plot reveals a small number of observations with relatively higher leverage and residual magnitude. However, most points lie well within the Cook’s distance contours, indicating that no single observation exerts undue influence on the fitted regression coefficients.
The influence index plots (Cook’s distance, studentized residuals, hat values) confirm that, although a few observations warrant closer inspection, their overall impact on model stability is limited. The majority of observations exhibit low leverage and low influence.
We first define a simple benchmark (naïve) model that predicts
flood_risk_scoreon the test set using only the mean value
observed in the training set. This provides a baseline to judge whether
our regression models deliver a meaningful improvement.
# First, we compute the benchmark prediction (in this case the mean)
bench_mean <- mean(TrainData$flood_risk_score, na.rm = TRUE)
# Then, we generate constant predictions for every test observation
pred_bench <- rep(bench_mean, nrow(TestData))
# Finally we evaluate benchmark predictive performance on the test set
RMSE_test_bench <- sqrt(mean((TestData$flood_risk_score - pred_bench)^2, na.rm = TRUE))
MAE_test_bench <- mean(abs(TestData$flood_risk_score - pred_bench), na.rm = TRUE)
cat("Benchmark mean (train):", round(bench_mean, 4), "\n")## Benchmark mean (train): 33.2741
## Benchmark RMSE (test): 14.6682
## Benchmark MAE (test): 11.7959
Note: \(R^2\) via
correlation is not defined here because pred_bench is
constant.
lm_all) attains a
substantially lower RMSE (≈ 5.42), indicating a large
improvement in predictive accuracy.In addition to point predictions, prediction intervals quantify the uncertainty around individual forecasts by accounting for both model uncertainty and irreducible noise in the response variable.
We compute 95% prediction intervals for the final selected linear regression model (lm_all) on the test set. These intervals provide a range within which future flood risk scores are expected to fall with a given confidence level.
# Computing prediction intervals on the test set
pred_int <- predict(
lm_all,
newdata = TestData,
interval = "prediction",
level = 0.95
)
# Storing predictions and intervals
TestData$Pred <- pred_int[, "fit"]
TestData$Lower <- pred_int[, "lwr"]
TestData$Upper <- pred_int[, "upr"]To visually assess the quality of the prediction intervals, we plot the observed flood risk scores together with the fitted values and their corresponding prediction bands.
TestData %>%
arrange(Pred) %>%
ggplot(aes(x = Pred)) +
geom_point(aes(y = flood_risk_score),
color = "blue", alpha = 0.5, size = 1.3) +
geom_line(aes(y = Pred),
color = "red", linewidth = 1) +
geom_ribbon(aes(ymin = Lower, ymax = Upper),
fill = "orange", alpha = 0.25) +
labs(
title = "Prediction Intervals vs Observed Flood Risk",
subtitle = "Red line: Predicted values | Blue points: Observed values",
x = "Predicted flood risk score",
y = "Flood risk score"
) +
theme_minimal()We assess the empirical coverage by computing the proportion of test observations that fall within the predicted intervals.
# Counting observations outside the prediction intervals
outside_interval <- sum(
TestData$flood_risk_score < TestData$Lower |
TestData$flood_risk_score > TestData$Upper
)
# Computing coverage
total_obs <- nrow(TestData)
coverage <- round(100 * (1 - outside_interval / total_obs), 1)
cat("Percentage of observations inside the prediction intervals:",
coverage, "%\n")## Percentage of observations inside the prediction intervals: 95.4 %
We now replicate our final linear regression model using the tidymodels framework, which standardises preprocessing (scaling + dummy encoding) and evaluates performance via cross-validation before fitting the final model and testing it on the hold-out set.
set.seed(123)
# 1) Preprocessing recipe (scale numeric + dummy categorical)
risk_recipe <- recipe(flood_risk_score ~ ., data = TrainData) %>%
# Assign ID role to non-predictive variables
update_role(
record_id, district, place_name, latitude, longitude, generation_date,
new_role = "id"
) %>%
# Convert character predictors to factors
step_string2factor(all_nominal_predictors()) %>%
# Dummy encode categorical predictors
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
# Center and scale numeric predictors
step_normalize(all_numeric_predictors())
# 2) Model specification (linear regression via lm)
lin_spec <- linear_reg() %>%
set_engine("lm")
# 3) Workflow
lin_wf <- workflow() %>%
add_recipe(risk_recipe) %>%
add_model(lin_spec)
# 4) Cross-validation (10-fold, 5 repeats) + metrics
cv_folds <- vfold_cv(TrainData, v = 10, repeats = 5)
cv_fit <- lin_wf %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(rmse, rsq, mae),
control = control_resamples(save_pred = TRUE)
)
cv_metrics <- cv_fit %>% collect_metrics()
cv_metrics## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 4.35 50 0.0147 pre0_mod0_post0
## 2 rmse standard 5.25 50 0.0193 pre0_mod0_post0
## 3 rsq standard 0.906 50 0.000967 pre0_mod0_post0
# 5) Fit final model on full training set + test-set evaluation
final_fit <- lin_wf %>% fit(data = TrainData)
test_predictions <- TestData %>%
dplyr::select(-any_of("Pred")) %>% # avoid name clashes
dplyr::bind_cols(
predict(final_fit, new_data = TestData) %>%
dplyr::rename(Pred = .pred)
)
# (Optional) sanity check
# names(test_predictions)
test_metrics <- test_predictions %>%
yardstick::metrics(truth = flood_risk_score, estimate = Pred)
test_metrics## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.37
## 2 rsq standard 0.897
## 3 mae standard 4.38
# 6) Observed vs Predicted plot (test set)
ggplot(test_predictions, aes(x = flood_risk_score, y = Pred)) +
geom_point(alpha = 0.5,col="blue") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", col="red") +
labs(
title = "Observed vs Predicted Flood Risk (Test Set)",
x = "Observed flood_risk_score",
y = "Predicted flood_risk_score"
) +
theme_minimal()# 7) Benchmark comparison (mean-only baseline on test set)
bench_mean <- mean(TrainData$flood_risk_score, na.rm = TRUE)
bench_predictions <- TestData %>%
mutate(Pred = bench_mean)
bench_metrics <- bench_predictions %>%
metrics(truth = flood_risk_score, estimate = Pred)
bench_metrics## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 14.7
## 2 rsq standard NA
## 3 mae standard 11.8
# Optional: print side-by-side
list(
Test_Metrics = test_metrics,
Benchmark_Metrics = bench_metrics
)## $Test_Metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5.37
## 2 rsq standard 0.897
## 3 mae standard 4.38
##
## $Benchmark_Metrics
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 14.7
## 2 rsq standard NA
## 3 mae standard 11.8
Classical linear regression relies on strong assumptions regarding the distribution of the response variable, most notably normality, constant variance, and linearity of the conditional mean. However, many real-world response variables violate these assumptions, particularly when the outcome is binary, categorical, or represents counts or proportions.
Generalized Linear Models (GLMs) extend the classical linear regression framework by allowing the response variable to follow a distribution from the exponential family and by introducing a link function that relates the expected value of the response to a linear predictor. This flexibility enables the modelling of non-Gaussian outcomes while retaining an interpretable regression structure.
In the context of flood risk analysis, certain variables of interest describe discrete events, such as whether a flood occurred or not, rather than continuous magnitudes. For such outcomes, linear regression is not appropriate, as it may produce predictions outside the admissible range and violate distributional assumptions. GLMs, and in particular logistic regression for binary responses, provide a principled framework to model event probabilities while ensuring valid predictions and interpretable effects of explanatory variables.
In the context of Generalized Linear Models, the choice of the response variable is closely tied to the nature of the data and the type of phenomenon under study. Unlike the previous section, where a continuous flood risk index was modelled using linear regression, this part of the analysis focuses on the occurrence of flood events as a binary outcome.
Among the available variables, the response variable selected for the
GLM analysis is
flood_occurrence_current_event, which
indicates whether a flood event occurred at a given location during the
simulated current period. This variable takes two possible values (“Yes”
or “No”), making it naturally suited for a binary response model.
Modelling flood occurrence rather than flood severity allows us to address a different but complementary research question: instead of estimating how severe flood risk is, we aim to quantify the probability that a flood event occurs given environmental, geographical, and socio-demographic conditions. This perspective is particularly relevant for early warning systems and risk classification tasks, where predicting the likelihood of an event is sometimes more important than predicting its magnitude.
In this section, we specify a Generalized Linear Model to analyse the
probability of flood occurrence. Given the binary nature of the response
variable, flood_occurrence_current_event, we adopt a
binomial GLM with a logistic link function.
Let \(Y_i\) denote the indicator of flood occurrence at location \(i\), where \(Y_i = 1\) if a flood occurred and \(Y_i = 0\) otherwise.
The model assumes that
\[
Y_i \sim \text{Bernoulli}(p_i),
\] where \(p_i = P(Y_i = 1 \mid
\mathbf{x}_i)\) is the conditional probability of a flood
event.
The relationship between the predictors and the response is modelled through the logit link function: \[ \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}. \]
This formulation ensures that predicted probabilities lie in the interval \([0,1]\) and allows the effects of explanatory variables to be interpreted in terms of log-odds. The selected predictors could include hydrological, geographical and socio-environmental variables previously identified.
As we explained previously, in contrast to the previous linear regression analysis, where the response variable was continuous, the Generalized Linear Model focuses on modelling the occurrence of flood events as a binary outcome. For this reason, the data partitioning strategy is adapted accordingly.
The dataset is divided into training and test sets using an 80/20
split, stratified by the binary response variable
flood_occurrence_current_event.
Stratification ensures that the proportion of flood and non-flood observations is preserved across both subsets, which is essential for reliable estimation and evaluation of probabilistic classification models. This approach prevents imbalances in class frequencies that could otherwise bias model fitting and performance assessment.
set.seed(123)
# Importing data
srilanka <- read_csv("sri_lanka_flood_risk_dataset_25000.csv")
# Ensuring that the GLM response is a factor with reference level = "No"
srilanka <- srilanka %>%
mutate(
flood_occurrence_current_event = factor(
flood_occurrence_current_event,
levels = c("No", "Yes")
)
)
# Stratified 80/20 train–test split (stratify by the GLM response)
data_split_glm <- initial_split(
srilanka,
prop = 0.8,
strata = flood_occurrence_current_event
)
TrainData <- training(data_split_glm)
TestData <- testing(data_split_glm)
# Balance verification
tibble(
Set = c("Train", "Test"),
n = c(nrow(TrainData), nrow(TestData)),
event_rate = c(mean(TrainData$flood_occurrence_current_event == "Yes"),
mean(TestData$flood_occurrence_current_event == "Yes"))
)## # A tibble: 2 × 3
## Set n event_rate
## <chr> <int> <dbl>
## 1 Train 20000 0.0973
## 2 Test 5000 0.105
All subsequent exploratory analysis and model estimation for the GLM are conducted on the training set, while the test set is reserved exclusively for out-of-sample evaluation.
Before conducting a detailed exploratory analysis of the predictors,
we first assess the distribution of the response variable in the
training and test sets. As shown in the previous subsection, the binary
response variable flood_occurrence_current_event is
correctly encoded with levels “No” and “Yes”, and the data are
partitioned using a stratified 80/20 train–test split.
The resulting class proportions are well preserved across both subsets, with an event rate of approximately 9.7% in the training set and 10.5% in the test set. This confirms that the stratification procedure has successfully maintained the relative frequency of flood events.
In this subsection, we examine the empirical distribution of the
binary response variable flood_occurrence_current_event in
the training set. This preliminary inspection is essential to understand
the baseline frequency of flood events and to assess the degree of class
imbalance, which directly impacts the interpretation and performance of
probabilistic classification models.
Since logistic regression models the probability of an event occurring, an imbalanced response distribution is common and does not invalidate the analysis. However, documenting the event rate provides important context for model estimation, coefficient interpretation, and subsequent performance evaluation.
# Distribution of the binary response variable in the training set
TrainData %>%
count(flood_occurrence_current_event) %>%
mutate(
proportion = n / sum(n)
)## # A tibble: 2 × 3
## flood_occurrence_current_event n proportion
## <fct> <int> <dbl>
## 1 No 18054 0.903
## 2 Yes 1946 0.0973
# Bar plot of flood occurrence
ggplot(TrainData, aes(x = flood_occurrence_current_event)) +
geom_bar(fill = "skyblue", alpha = 0.7) +
labs(
title = "Distribution of Flood Occurrence in the Training Set",
x = "Flood occurrence (current event)",
y = "Number of observations"
) +
theme_minimal()To better understand the relationship between the explanatory variables and the probability of flood occurrence, we examine how different key predictors are distributed conditional on the binary response variable. Unlike the exploratory analysis conducted for linear regression, the focus here is not on linearity or normality, but on identifying systematic differences in predictor distributions between flood and non-flood observations.
Inspecting predictor behaviour across outcome classes provides valuable insight into their potential discriminative power and helps assess whether higher or lower values of a given variable are associated with an increased likelihood of flood occurrence. This analysis also supports subsequent model specification and interpretation of regression coefficients in the logistic regression framework.
We begin by comparing the distribution of elevation across flood and non-flood observations. Elevation is a fundamental geographic determinant of flood susceptibility, as lower-lying areas are more prone to water accumulation and overflow. Examining elevation conditional on flood occurrence allows us to assess whether flood events are systematically associated with lower altitudes.
Due to the highly right-skewed distribution of elevation and the presence of extreme values, we re-visualise elevation on a logarithmic scale to improve the comparison between flood and non-flood observations.
# Distribution of elevation by flood occurrence (log scale)
ggplot(
TrainData,
aes(x = flood_occurrence_current_event, y = elevation_m)
) +
geom_boxplot(fill = "skyblue", alpha = 0.6) +
scale_y_log10() +
labs(
title = "Elevation by Flood Occurrence (Training Set, Log Scale)",
x = "Flood occurrence (current event)",
y = "Elevation (meters, log scale)"
) +
theme_minimal()When visualized on a logarithmic scale, elevation exhibits a clear systematic difference between flood and non-flood observations. Flood events are associated with lower elevations on average, with the entire distribution for flood occurrences shifted downward relative to non-flood cases. Although substantial overlap remains between the two groups, the observed pattern confirms elevation as a meaningful predictor of flood occurrence.
After analyzing elevation, we now examine the distribution of the distance to the nearest river conditional on flood occurrence. By comparing the distributions of distance to river for flood and non-flood observations, we assess whether flood events tend to occur systematically closer to river networks.
# Distribution of distance to nearest river by flood occurrence
ggplot(
TrainData,
aes(x = flood_occurrence_current_event, y = distance_to_river_m)
) +
geom_boxplot(fill = "skyblue", alpha = 0.6) +
labs(
title = "Distance to Nearest River by Flood Occurrence (Training Set)",
x = "Flood occurrence (current event)",
y = "Distance to nearest river (meters)"
) +
theme_minimal()Due to the strongly right-skewed nature of distances to river and the presence of extreme values, we further examine the relationship between distance to river and flood occurrence by estimating empirical flood rates across distance intervals.
TrainData %>%
mutate(
distance_bin = cut(
distance_to_river_m,
breaks = quantile(distance_to_river_m, probs = seq(0, 1, 0.1), na.rm =
TRUE),
include.lowest = TRUE
)
) %>%
group_by(distance_bin) %>%
summarise(
flood_rate = mean(flood_occurrence_current_event == "Yes"),
n = n()
) %>%
ggplot(aes(x = distance_bin, y = flood_rate)) +
geom_point(color = "skyblue") +
geom_line(group = 1, color = "skyblue") +
labs(
title = "Empirical Flood Probability by Distance to River (Training Set)",
x = "Distance to nearest river (binned)",
y = "Flood occurrence rate"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The distribution of distance to the nearest river exhibits substantial overlap between flood and non-flood observations, indicating limited marginal separation when considered in isolation. However, the analysis of empirical flood probabilities across distance intervals reveals a clear decreasing trend in flood occurrence as distance to river networks increases. Locations closest to rivers display markedly higher flood probabilities, while the event rate progressively declines for more distant areas.
These findings suggest that proximity to rivers contributes to flood
risk in a probabilistic manner and supports the inclusion of
distance_to_river_m as a relevant explanatory
variable in the subsequent logistic regression models, despite
its limited standalone discriminative power.
After analyzing topographic and hydrological proximity variables, we now examine the distribution of recent rainfall conditional on flood occurrence.
# Distribution of rainfall over the last 7 days by flood occurrence
ggplot(
TrainData,
aes(x = flood_occurrence_current_event, y = rainfall_7d_mm)
) +
geom_boxplot(fill = "skyblue", alpha = 0.6) +
labs(
title = "Rainfall over Last 7 Days by Flood Occurrence (Training Set)",
x = "Flood occurrence (current event)",
y = "Rainfall over last 7 days (mm)"
) +
theme_minimal()The distribution of rainfall accumulated over the previous seven days exhibits a clear separation between flood and non-flood observations. Locations where a flood event occurred show substantially higher median rainfall levels and a generally upward-shifted distribution compared to non-flood locations.
This pattern indicates a strong positive association between recent rainfall intensity and the probability of flood occurrence. While variability and extreme values are present in both groups, flood events are concentrated at markedly higher rainfall levels, supporting the role of short-term precipitation as a key driver of flooding.
These findings provide strong empirical justification for
including rainfall_7d_mm as an explanatory
variable in the subsequent binomial GLM, where its effect can
be formally quantified in terms of changes in flood occurrence
probability.
After analysing short-term precipitation through rainfall accumulated over the previous seven days, we now examine the distribution of monthly accumulated rainfall conditional on flood occurrence.
# Distribution of monthly rainfall by flood occurrence
ggplot(
TrainData,
aes(x = flood_occurrence_current_event, y = monthly_rainfall_mm)
) +
geom_boxplot(fill = "skyblue", alpha = 0.6) +
labs(
title = "Monthly Rainfall by Flood Occurrence (Training Set)",
x = "Flood occurrence (current event)",
y = "Monthly rainfall (mm)"
) +
theme_minimal()The distribution of monthly accumulated rainfall exhibits a clear
separation between flood and non-flood observations. While both groups
display right-skewed distributions with some extreme values, the upward
shift of the entire distribution for flood observations suggests that
monthly rainfall captures longer-term hydrological saturation effects
that complement short-term rainfall indicators. This systematic
difference supports the inclusion of
monthly_rainfall_mm as a key explanatory variable
in the subsequent logistic regression models.
After analysing key continuous predictors, we now examine how different categorical variables are distributed across flood and non-flood observations. For categorical predictors, the goal is not to assess linear trends, but to identify whether certain categories are disproportionately associated with flood occurrence. Such differences provide insight into potential structural or environmental risk factors and inform the interpretation of coefficients in the logistic regression model.
# Distribution of land cover by flood occurrence
ggplot(TrainData, aes(x = landcover, fill = flood_occurrence_current_event)) +
geom_bar(position = "fill", alpha = 0.8) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Flood Occurrence by Land Cover Type (Training Set)",
x = "Land cover type",
y = "Proportion within land cover",
fill = "Flood occurrence"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))# Distribution of flood occurrence by urban/rural classification
ggplot(TrainData, aes(x = urban_rural, fill = flood_occurrence_current_event)) +
geom_bar(position = "fill", alpha = 0.8) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Flood Occurrence by Urban–Rural Classification (Training Set)",
x = "Urban / Rural",
y = "Proportion within category",
fill = "Flood occurrence"
) +
theme_minimal()# Distribution of flood occurrence by soil type
ggplot(TrainData, aes(x = soil_type, fill = flood_occurrence_current_event)) +
geom_bar(position = "fill", alpha = 0.8) +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Flood Occurrence by Soil Type (Training Set)",
x = "Soil type",
y = "Proportion within soil type",
fill = "Flood occurrence"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))Regarding land cover, flood events are observed across all categories, with slightly higher proportions in land-cover types associated with reduced infiltration or proximity to water bodies, such as wetlands and urban areas. Although differences are not extreme, the non-uniform distribution suggests that land cover may contribute to flood susceptibility through surface runoff and land-use characteristics.
The urban–rural classification shows a marginally higher proportion of flood occurrences in urban locations compared to rural ones. This pattern is consistent with the presence of impervious surfaces and altered drainage systems in urban environments, which can increase runoff accumulation during heavy rainfall events.
For soil type, flood occurrence rates appear broadly similar across categories, with only mild variation. Nevertheless, soils with lower permeability, such as clay or peaty soils, exhibit slightly higher flood proportions, which aligns with hydrological expectations regarding infiltration capacity.
As a starting point, we fit an intercept-only binomial Generalized Linear Model to estimate the unconditional probability of flood occurrence. This baseline model does not include any explanatory variables and serves as a reference against which more complex models can be compared.
Let \(Y_i\) denote the binary
indicator of flood occurrence at location \(i\), where
\(Y_i = 1\) if a flood event occurred
and \(Y_i = 0\) otherwise. The model
assumes that \[
Y_i \sim \text{Bernoulli}(p),
\] where \(p = P(Y_i = 1)\) is
constant across all observations.
Using a binomial Generalized Linear Model with a logistic link function, the model is specified as \[ \log\left(\frac{p}{1 - p}\right) = \beta_0, \] where \(\beta_0\) is the intercept term. The estimated intercept therefore corresponds to the log-odds of a flood event occurring in the training data.
By applying the inverse logit transformation, the unconditional flood probability is obtained as \[ p = \frac{\exp(\beta_0)}{1 + \exp(\beta_0)}. \]
This baseline model provides a benchmark level of predictive performance and establishes a reference point for evaluating the incremental explanatory power of introducing covariates in subsequent logistic regression models.
# Baseline intercept-only logistic regression
glm_null <- glm(
flood_occurrence_current_event ~ 1,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_null)##
## Call:
## glm(formula = flood_occurrence_current_event ~ 1, family = binomial(link = "logit"),
## data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.22759 0.02386 -93.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 12764 on 19999 degrees of freedom
## AIC: 12766
##
## Number of Fisher Scoring iterations: 5
## (Intercept)
## 0.0973
The intercept-only logistic regression model estimates an unconditional flood occurrence probability of approximately 9.73%, which coincides with the empirical event rate observed in the training data. This confirms the correct specification of the binomial GLM and provides a baseline level of model fit against which all subsequent models will be compared.
We next consider a series of simple binomial Generalized Linear Models, each including a single explanatory variable. These models allow us to evaluate the individual association between each predictor and the probability of flood occurrence, while maintaining a parsimonious specification.
For each predictor, we estimate a binomial GLM with a logistic link function and compare its fit to the intercept-only baseline model. This stepwise approach provides insight into the marginal explanatory power of individual covariates and facilitates the interpretation of regression coefficients in terms of log-odds and odds ratios.
Let \(Y_i\) denote the binary indicator of flood occurrence at location \(i\), and let \(x_i\) be a single explanatory variable. The model is specified as
\[ Y_i \sim \text{Bernoulli}(p_i), \qquad \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_i. \]
Here, \(\beta_1\) represents the change in the log-odds of flood occurrence associated with a one-unit increase in the predictor \(x_i\). Positive values of \(\beta_1\) indicate an increased probability of flood occurrence, while negative values indicate a reduced probability.
We begin the sequence of simple logistic regression models by
considering elevation_m as the sole explanatory
variable.
# Simple logistic regression with elevation only
glm_elev <- glm(
flood_occurrence_current_event ~ elevation_m,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_elev)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m, family = binomial(link = "logit"),
## data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.131e+00 2.729e-02 -78.113 < 2e-16 ***
## elevation_m -5.783e-04 9.015e-05 -6.415 1.41e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 12717 on 19998 degrees of freedom
## AIC: 12721
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1186688 0.1124544 0.1251503
## elevation_m 0.9994219 0.9992413 0.9995947
We can extract the following conclusions:
Elevation is a statistically significant predictor of flood occurrence (\(p < 10^{-9}\)), with a negative effect on the probability of flooding.
The estimated odds ratio for elevation_m is 0.9994
(95% CI: [0.99924, 0.99959]), indicating that higher elevations are
associated with lower flood risk.
Although the effect per meter is small, cumulative elevation differences translate into substantial changes in flood probability.
The model improves upon the intercept-only baseline, as evidenced by a reduction in deviance and AIC.
These results are consistent with hydrological theory and confirm elevation as a relevant explanatory variable for flood occurrence.
We now fit a simple logistic regression model using
distance_to_river_m as the sole explanatory variable to
assess its effect on the probability of flood occurrence.
# Simple logistic regression with distance to nearest river
glm_river <- glm(
flood_occurrence_current_event ~ distance_to_river_m,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_river)##
## Call:
## glm(formula = flood_occurrence_current_event ~ distance_to_river_m,
## family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.954e+00 3.396e-02 -57.53 <2e-16 ***
## distance_to_river_m -1.527e-04 1.506e-05 -10.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 12644 on 19998 degrees of freedom
## AIC: 12648
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1417118 0.1325513 0.1514269
## distance_to_river_m 0.9998473 0.9998173 0.9998764
The simple logistic regression model using
distance_to_river_m as the sole predictor shows a strong
and statistically significant association with flood occurrence:
The estimated coefficient for distance_to_river_m is
negative and highly significant (\(p < 2
\times 10^{-16}\)), indicating that flood probability decreases
as distance from the nearest river increases.
The odds ratio is approximately \(0.99985\) per additional meter, implying a small effect at the unit level, but a substantial cumulative effect over hundreds or thousands of meters.
Compared to the intercept-only model, the residual deviance decreases markedly and the AIC is substantially lower, indicating a clear improvement in model fit.
We now consider short-term rainfall (7-day accumulation) and longer-term rainfall (monthly accumulation) as explanatory variables, fitting separate logistic regression models to assess and compare their individual contributions to the probability of flood occurrence.
# Simple logistic regression with rainfall over the last 7 days
glm_rain7 <- glm(
flood_occurrence_current_event ~ rainfall_7d_mm,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_rain7)##
## Call:
## glm(formula = flood_occurrence_current_event ~ rainfall_7d_mm,
## family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6479677 0.0528156 -69.07 <2e-16 ***
## rainfall_7d_mm 0.0141256 0.0003902 36.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 11377 on 19998 degrees of freedom
## AIC: 11381
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.026044 0.02346018 0.02885738
## rainfall_7d_mm 1.014226 1.01345347 1.01500484
# Simple logistic regression with monthly rainfall
glm_rain_month <- glm(
flood_occurrence_current_event ~ monthly_rainfall_mm,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_rain_month)##
## Call:
## glm(formula = flood_occurrence_current_event ~ monthly_rainfall_mm,
## family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.6863273 0.0711544 -65.86 <2e-16 ***
## monthly_rainfall_mm 0.0087855 0.0001996 44.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 10311 on 19998 degrees of freedom
## AIC: 10315
##
## Number of Fisher Scoring iterations: 6
# Odds ratios and 95% confidence intervals
exp(cbind(
OR = coef(glm_rain_month),
confint(glm_rain_month)
))## OR 2.5 % 97.5 %
## (Intercept) 0.009220489 0.008007769 0.01058419
## monthly_rainfall_mm 1.008824168 1.008432539 1.00922176
According to the results, we can deduce that:
Both 7-day rainfall and monthly rainfall are highly statistically significant predictors of flood occurrence (p < 2e-16), confirming the central role of precipitation in flood generation.
For rainfall over the last 7 days, the estimated odds ratio (≈ 1.014) indicates that each additional millimetre of recent rainfall increases the odds of a flood event by about 1.4%, holding all else constant.
For monthly rainfall, the odds ratio (≈ 1.009) implies a smaller but cumulative effect, where sustained higher rainfall levels increase flood risk through longer-term soil saturation and hydrological stress.
The monthly rainfall model achieves a substantially lower AIC (10315) than the 7-day rainfall model (11381), indicating a better overall model fit when longer-term accumulated rainfall is used as the sole predictor.
As a next step, we extend the simple logistic regression framework to a multivariate setting by jointly including multiple explanatory variables within the same binomial Generalized Linear Model. These multiple logistic regression models allow us to assess the conditional effect of each predictor on the probability of flood occurrence while controlling for the influence of the remaining covariates.
By considering several predictors simultaneously, this approach enables us to evaluate whether variables that are individually significant retain their explanatory power once potential confounding effects are accounted for. In addition, it allows us to examine how the magnitude and statistical significance of regression coefficients change relative to the corresponding univariate models.
Let \(Y_i\) denote the binary indicator of flood occurrence at location \(i\), and let \(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})\) denote a vector of explanatory variables. The multiple logistic regression model is specified as \[ Y_i \sim \text{Bernoulli}(p_i), \qquad \log\!\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}. \]
In this formulation, each coefficient \(\beta_j\) represents the change in the log-odds of flood occurrence associated with a one-unit increase in predictor \(x_{ij}\), holding all other variables constant. This multivariate perspective provides a more realistic representation of flood risk dynamics, where topographical, hydrological, and meteorological factors act jointly rather than in isolation.
We begin by jointly considering topographical and hydrological proximity factors. Specifically, we include elevation and distance to the nearest river as simultaneous explanatory variables in order to assess their combined and conditional effects on the probability of flood occurrence.
This model allows us to evaluate whether each predictor retains explanatory power after accounting for the other and to examine potential changes in magnitude or significance relative to the corresponding univariate models.
# Multiple logistic regression: elevation + distance to river
glm_geo <- glm(
flood_occurrence_current_event ~ elevation_m + distance_to_river_m,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_geo)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m, family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.857e+00 3.652e-02 -50.862 < 2e-16 ***
## elevation_m -5.796e-04 9.040e-05 -6.411 1.45e-10 ***
## distance_to_river_m -1.528e-04 1.507e-05 -10.144 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 12596 on 19997 degrees of freedom
## AIC: 12602
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.1560772 0.1452581 0.1676152
## elevation_m 0.9994206 0.9992395 0.9995939
## distance_to_river_m 0.9998472 0.9998172 0.9998763
According to the results, we can deduce that:
Both elevation and distance to the nearest river remain highly statistically significant predictors of flood occurrence (p < 2e-16) when included jointly in the model, indicating that each variable contributes independent explanatory information.
The estimated odds ratio for elevation_m (≈ 0.99942) confirms a negative association between elevation and flood risk, implying that higher elevations are associated with lower odds of flood occurrence, even after controlling for river proximity.
The odds ratio for distance_to_river_m (≈ 0.99985) indicates that increasing distance from the nearest river reduces flood risk, with a small per-metre effect that accumulates substantially over hundreds or thousands of metres.
Compared to the corresponding single-predictor models, the estimated coefficients remain stable in magnitude and significance, suggesting limited confounding between elevation and river proximity.
The multivariate model achieves a lower AIC (12602) than both univariate models, indicating an improved overall model fit and supporting the joint inclusion of topographical elevation and hydrological proximity as key determinants of flood occurrence.
We now fit a multivariate binomial Generalized Linear Model that jointly incorporates topographic, hydrological proximity, and meteorological predictors. Specifically, elevation, distance to the nearest river, and short-term rainfall accumulation are included as explanatory variables.
This model allows us to evaluate the combined and independent contributions of terrain, river proximity, and recent precipitation to flood occurrence, while controlling for potential confounding effects among these key drivers.
# Multivariate logistic regression:
# elevation + distance to river + short-term rainfall
glm_env_rain <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_env_rain)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m + rainfall_7d_mm, family = binomial(link = "logit"),
## data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.269e+00 5.917e-02 -55.244 < 2e-16 ***
## elevation_m -6.455e-04 9.374e-05 -6.886 5.73e-12 ***
## distance_to_river_m -1.641e-04 1.571e-05 -10.445 < 2e-16 ***
## rainfall_7d_mm 1.432e-02 3.951e-04 36.256 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764 on 19999 degrees of freedom
## Residual deviance: 11196 on 19996 degrees of freedom
## AIC: 11204
##
## Number of Fisher Scoring iterations: 6
# Odds ratios and 95% confidence intervals
exp(cbind(
OR = coef(glm_env_rain),
confint(glm_env_rain)
))## OR 2.5 % 97.5 %
## (Intercept) 0.03804534 0.03385139 0.04268952
## elevation_m 0.99935472 0.99916728 0.99953469
## distance_to_river_m 0.99983595 0.99980474 0.99986631
## rainfall_7d_mm 1.01442796 1.01364577 1.01521708
According to the results, we can deduce that:
Although elevation, distance to river, and short-term rainfall are all individually and jointly statistically significant predictors, the combined model does not outperform the monthly rainfall–only model in terms of AIC. The substantially lower AIC obtained (10315) for the monthly rainfall model indicates that long-term accumulated precipitation alone captures most of the information relevant for explaining flood occurrence in the data.
This result suggests that monthly rainfall acts as a dominant explanatory variable, effectively summarizing broader hydrological conditions, while the additional topographic and proximity variables provide limited incremental improvement in model fit once monthly precipitation is accounted for.
We now consider a broader set of potential explanatory variables in order to construct more complex logistic regression models. While it is well understood that increasing the number of predictors does not necessarily lead to improved predictive performance, this exploratory step allows us to assess whether additional environmental, socio-demographic, and infrastructural variables provide complementary information beyond the core predictors identified earlier.
Note: The variable flood_risk_score is
not used as a predictor in the GLM analysis because it is a derived
composite index constructed from several environmental and
socio-hydrological variables. Including it would introduce circularity
and compromise the interpretability of the estimated effects on flood
occurrence probability.
We begin by fitting a multivariate binomial Generalized Linear Model that incorporates key environmental and hydrological predictors previously identified as relevant in both the exploratory analysis and the linear regression task. This model includes terrain elevation, proximity to rivers, short and long-term rainfall accumulation, and local drainage conditions.
glm_M1_env <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_M1_env)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm +
## drainage_index, family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.164e+00 1.287e-01 -47.908 < 2e-16 ***
## elevation_m -8.296e-04 1.078e-04 -7.699 1.37e-14 ***
## distance_to_river_m -2.115e-04 1.808e-05 -11.700 < 2e-16 ***
## rainfall_7d_mm 1.932e-02 5.016e-04 38.507 < 2e-16 ***
## monthly_rainfall_mm 1.093e-02 2.440e-04 44.781 < 2e-16 ***
## drainage_index -1.309e+00 1.304e-01 -10.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764.4 on 19999 degrees of freedom
## Residual deviance: 8215.8 on 19994 degrees of freedom
## AIC: 8227.8
##
## Number of Fisher Scoring iterations: 7
## OR 2.5 % 97.5 %
## (Intercept) 0.002102986 0.001629891 0.002699191
## elevation_m 0.999170773 0.998955974 0.999378179
## distance_to_river_m 0.999788506 0.999752597 0.999823453
## rainfall_7d_mm 1.019504871 1.018510260 1.020515361
## monthly_rainfall_mm 1.010986620 1.010507903 1.011475030
## drainage_index 0.270128887 0.209078840 0.348568019
We next extend the environmental–hydrological model by incorporating remotely sensed indices related to vegetation cover and surface water presence.
glm_M2_env_veg <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index +
ndvi +
ndwi,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_M2_env_veg)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm +
## drainage_index + ndvi + ndwi, family = binomial(link = "logit"),
## data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.198e+00 1.305e-01 -47.502 < 2e-16 ***
## elevation_m -8.311e-04 1.078e-04 -7.712 1.24e-14 ***
## distance_to_river_m -2.115e-04 1.808e-05 -11.696 < 2e-16 ***
## rainfall_7d_mm 1.933e-02 5.018e-04 38.511 < 2e-16 ***
## monthly_rainfall_mm 1.094e-02 2.442e-04 44.778 < 2e-16 ***
## drainage_index -1.308e+00 1.304e-01 -10.030 < 2e-16 ***
## ndvi 1.560e-01 1.054e-01 1.480 0.139
## ndwi 1.088e-01 1.093e-01 0.995 0.320
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764.4 on 19999 degrees of freedom
## Residual deviance: 8212.7 on 19992 degrees of freedom
## AIC: 8228.7
##
## Number of Fisher Scoring iterations: 7
## OR 2.5 % 97.5 %
## (Intercept) 0.002033085 0.001570075 0.002618685
## elevation_m 0.999169229 0.998954380 0.999376683
## distance_to_river_m 0.999788564 0.999752652 0.999823514
## rainfall_7d_mm 1.019513992 1.018519026 1.020524848
## monthly_rainfall_mm 1.010996688 1.010517512 1.011485583
## drainage_index 0.270286264 0.209173042 0.348815566
## ndvi 1.168883223 0.950724454 1.437229007
## ndwi 1.114903219 0.899902166 1.381181925
We now consider a comprehensive mixed-effects logistic regression model that combines continuous environmental, hydrological, and human exposure variables with categorical land-use and spatial classification factors. Land cover type, soil type, and urban–rural classification are included to capture structural differences in flood susceptibility across locations.
glm_M4_full <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index +
population_density_per_km2 +
built_up_percent +
infrastructure_score +
landcover +
soil_type +
urban_rural,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_M4_full)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm +
## drainage_index + population_density_per_km2 + built_up_percent +
## infrastructure_score + landcover + soil_type + urban_rural,
## family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.792e+00 1.687e-01 -34.336 < 2e-16 ***
## elevation_m -8.894e-04 1.090e-04 -8.157 3.44e-16 ***
## distance_to_river_m -2.116e-04 1.816e-05 -11.656 < 2e-16 ***
## rainfall_7d_mm 1.951e-02 5.069e-04 38.488 < 2e-16 ***
## monthly_rainfall_mm 1.107e-02 2.475e-04 44.728 < 2e-16 ***
## drainage_index -1.317e+00 1.313e-01 -10.029 < 2e-16 ***
## population_density_per_km2 4.004e-05 8.698e-05 0.460 0.6453
## built_up_percent 3.868e-03 1.655e-03 2.337 0.0195 *
## infrastructure_score -1.328e-02 1.491e-03 -8.907 < 2e-16 ***
## landcoverBare Soil -9.591e-02 1.467e-01 -0.654 0.5132
## landcoverForest -2.431e-02 9.340e-02 -0.260 0.7947
## landcoverPlantation 7.924e-03 9.925e-02 0.080 0.9364
## landcoverScrub -1.723e-02 1.028e-01 -0.168 0.8670
## landcoverUrban -2.033e-01 1.370e-01 -1.485 0.1377
## landcoverWetland 1.978e-01 1.057e-01 1.870 0.0615 .
## soil_typeLoamy -1.487e-01 9.480e-02 -1.568 0.1168
## soil_typePeaty 1.271e-01 9.179e-02 1.384 0.1663
## soil_typeSandy 7.102e-02 9.190e-02 0.773 0.4396
## soil_typeSilty -1.039e-02 9.240e-02 -0.112 0.9105
## urban_ruralUrban 1.922e-01 1.275e-01 1.507 0.1317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764.4 on 19999 degrees of freedom
## Residual deviance: 8100.8 on 19980 degrees of freedom
## AIC: 8140.8
##
## Number of Fisher Scoring iterations: 7
## OR 2.5 % 97.5 %
## (Intercept) 0.003052187 0.002187128 0.004237125
## elevation_m 0.999111018 0.998893758 0.999320980
## distance_to_river_m 0.999788369 0.999752299 0.999823472
## rainfall_7d_mm 1.019700391 1.018695334 1.020721722
## monthly_rainfall_mm 1.011131510 1.010645972 1.011627073
## drainage_index 0.267936061 0.206992926 0.346379584
## population_density_per_km2 1.000040040 0.999868912 1.000209934
## built_up_percent 1.003875718 1.000619104 1.007134126
## infrastructure_score 0.986809240 0.983922478 0.989689626
## landcoverBare Soil 0.908541549 0.678023934 1.205517487
## landcoverForest 0.975986779 0.812007714 1.171141313
## landcoverPlantation 1.007955073 0.828762667 1.223078582
## landcoverScrub 0.982919916 0.802311663 1.200834744
## landcoverUrban 0.816018625 0.623834818 1.067247851
## landcoverWetland 1.218667392 0.989080371 1.497293738
## soil_typeLoamy 0.861842590 0.715561908 1.037730358
## soil_typePeaty 1.135476497 0.948613232 1.359543874
## soil_typeSandy 1.073607555 0.896692763 1.285707457
## soil_typeSilty 0.989663649 0.825735363 1.186262440
## urban_ruralUrban 1.211945815 0.943451069 1.555417048
Finally, we explore a targeted interaction model motivated by hydrological theory. Specifically, we include interactions between precipitation intensity and drainage capacity, as well as between short-term rainfall and river proximity. These interactions allow the effect of rainfall on flood occurrence to vary depending on local drainage conditions and hydrological connectivity.
glm_M5_interact <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index +
rainfall_7d_mm:distance_to_river_m +
monthly_rainfall_mm:drainage_index,
data = TrainData,
family = binomial(link = "logit")
)
summary(glm_M5_interact)##
## Call:
## glm(formula = flood_occurrence_current_event ~ elevation_m +
## distance_to_river_m + rainfall_7d_mm + monthly_rainfall_mm +
## drainage_index + rainfall_7d_mm:distance_to_river_m + monthly_rainfall_mm:drainage_index,
## family = binomial(link = "logit"), data = TrainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.950e+00 2.067e-01 -28.789 < 2e-16 ***
## elevation_m -8.305e-04 1.077e-04 -7.709 1.27e-14 ***
## distance_to_river_m -2.275e-04 3.794e-05 -5.998 2.00e-09 ***
## rainfall_7d_mm 1.908e-02 6.815e-04 28.001 < 2e-16 ***
## monthly_rainfall_mm 1.035e-02 5.272e-04 19.631 < 2e-16 ***
## drainage_index -1.706e+00 3.508e-01 -4.863 1.15e-06 ***
## distance_to_river_m:rainfall_7d_mm 1.292e-07 2.700e-07 0.479 0.632
## monthly_rainfall_mm:drainage_index 1.206e-03 9.863e-04 1.222 0.222
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12764.4 on 19999 degrees of freedom
## Residual deviance: 8214.1 on 19992 degrees of freedom
## AIC: 8230.1
##
## Number of Fisher Scoring iterations: 7
## OR 2.5 % 97.5 %
## (Intercept) 0.002606094 0.001730671 0.003891244
## elevation_m 0.999169852 0.998955090 0.999377234
## distance_to_river_m 0.999772493 0.999696856 0.999845552
## rainfall_7d_mm 1.019265388 1.017912582 1.020635772
## monthly_rainfall_mm 1.010404049 1.009369514 1.011458098
## drainage_index 0.181619432 0.091212976 0.360803667
## distance_to_river_m:rainfall_7d_mm 1.000000129 0.999999603 1.000000662
## monthly_rainfall_mm:drainage_index 1.001206397 0.999272484 1.003143930
To facilitate interpretation, we summarise the fitted models in terms of their residual deviance, Akaike Information Criterion (AIC), and the percentage improvement in deviance relative to the null (intercept-only) model. The percentage improvement quantifies how much of the unexplained variability in flood occurrence is reduced by each model compared to the baseline with no predictors.
null_dev <- glm_null$deviance
model_comparison <- tibble(
Model = c(
"Null (Intercept-only)",
"Elevation",
"Distance to river",
"Rainfall (7-day)",
"Rainfall (monthly)",
"Elevation + Distance",
"Elevation + Distance + Rainfall (7-day)",
"Env + Hydro (Elev + Dist + Rain + Drainage)",
"Env + Hydro + Vegetation",
"Comprehensive model (Env + Socio + Categorical)",
"Env + Hydro + Interactions"
),
Predictors = c(
0, 1, 1, 1, 1, 2, 3, 5, 7, 11, 7
),
Residual_Deviance = c(
glm_null$deviance,
glm_elev$deviance,
glm_river$deviance,
glm_rain7$deviance,
glm_rain_month$deviance,
glm_geo$deviance,
glm_env_rain$deviance,
glm_M1_env$deviance,
glm_M2_env_veg$deviance,
glm_M4_full$deviance,
glm_M5_interact$deviance
),
Improvement_Percent = c(
0,
(null_dev - glm_elev$deviance) / null_dev * 100,
(null_dev - glm_river$deviance) / null_dev * 100,
(null_dev - glm_rain7$deviance) / null_dev * 100,
(null_dev - glm_rain_month$deviance) / null_dev * 100,
(null_dev - glm_geo$deviance) / null_dev * 100,
(null_dev - glm_env_rain$deviance) / null_dev * 100,
(null_dev - glm_M1_env$deviance) / null_dev * 100,
(null_dev - glm_M2_env_veg$deviance) / null_dev * 100,
(null_dev - glm_M4_full$deviance) / null_dev * 100,
(null_dev - glm_M5_interact$deviance) / null_dev * 100
),
AIC = c(
AIC(glm_null),
AIC(glm_elev),
AIC(glm_river),
AIC(glm_rain7),
AIC(glm_rain_month),
AIC(glm_geo),
AIC(glm_env_rain),
AIC(glm_M1_env),
AIC(glm_M2_env_veg),
AIC(glm_M4_full),
AIC(glm_M5_interact)
)
)
model_comparison_clean <- model_comparison %>%
mutate(
Residual_Deviance = round(Residual_Deviance, 1),
Improvement_Percent = round(Improvement_Percent, 1),
AIC = round(AIC, 1)
)
kable(
model_comparison_clean,
align = "lcccc"
)| Model | Predictors | Residual_Deviance | Improvement_Percent | AIC |
|---|---|---|---|---|
| Null (Intercept-only) | 0 | 12764.4 | 0.0 | 12766.4 |
| Elevation | 1 | 12716.8 | 0.4 | 12720.8 |
| Distance to river | 1 | 12643.7 | 0.9 | 12647.7 |
| Rainfall (7-day) | 1 | 11377.0 | 10.9 | 11381.0 |
| Rainfall (monthly) | 1 | 10311.2 | 19.2 | 10315.2 |
| Elevation + Distance | 2 | 12596.1 | 1.3 | 12602.1 |
| Elevation + Distance + Rainfall (7-day) | 3 | 11196.1 | 12.3 | 11204.1 |
| Env + Hydro (Elev + Dist + Rain + Drainage) | 5 | 8215.8 | 35.6 | 8227.8 |
| Env + Hydro + Vegetation | 7 | 8212.7 | 35.7 | 8228.7 |
| Comprehensive model (Env + Socio + Categorical) | 11 | 8100.8 | 36.5 | 8140.8 |
| Env + Hydro + Interactions | 7 | 8214.1 | 35.6 | 8230.1 |
Based on the comparative analysis of model fit and complexity summarised in the table, we select two models for further interpretation and predictive assessment.
The first selected model is the Comprehensive model (Environmental + Socio-demographic + Categorical predictors). This specification achieves the lowest residual deviance (8100.8) and the largest improvement over the null model (36.5%), indicating the strongest overall explanatory power. Although it includes a relatively large number of predictors, the corresponding AIC (8140.8) is the lowest among all considered models, suggesting that the gain in goodness of fit more than compensates for the increased model complexity. This model therefore represents the best-performing specification in terms of global fit and information criteria.
As a complementary and more parsimonious alternative, we also select the Environmental + Hydrological model (Elevation, Distance to River, Rainfall, Drainage Index). This model explains a substantial proportion of the null deviance (35.6% improvement) while relying on a considerably smaller set of predictors. Its residual deviance (8215.8) and AIC (8227.8) are only marginally higher than those of more complex specifications, indicating that most of the explanatory power is already captured by key environmental and hydrological drivers.
In this section, we will interpret one of the selected binomial Generalized Linear Model using marginal effects and odds ratios. Rather than focusing solely on raw regression coefficients, we examine how the predicted probability of flood occurrence varies with key explanatory variables, while holding the remaining covariates fixed at representative values.
Based on the model comparison conducted in Section 2.6, we focus on the Environmental + Hydrological model, which includes elevation, distance to the nearest river, short-term rainfall (7-day accumulation), monthly rainfall and drainage conditions. This model achieves a substantial improvement in goodness of fit relative to the null model while maintaining a high level of interpretability and avoiding unnecessary complexity.
# First, we re-define our model
glm_M1_env <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index,
data = TrainData,
family = binomial(link = "logit")
)
# Elevation
plot(effect("elevation_m", glm_M1_env),
main = "Effect of Elevation on Flood Probability",
ylab = "Predicted Probability of Flood")# Distance to river
plot(effect("distance_to_river_m", glm_M1_env),
main = "Effect of Distance to River on Flood Probability",
ylab = "Predicted Probability of Flood")# 7-day rainfall
plot(effect("rainfall_7d_mm", glm_M1_env),
main = "Effect of 7-Day Rainfall on Flood Probability",
ylab = "Predicted Probability of Flood")# Monthly rainfall
plot(effect("monthly_rainfall_mm", glm_M1_env),
main = "Effect of Monthly Rainfall on Flood Probability",
ylab = "Predicted Probability of Flood")# Drainage index
plot(effect("drainage_index", glm_M1_env),
main = "Effect of Drainage Index on Flood Probability",
ylab = "Predicted Probability of Flood")
According to the plots, we can deduce that:
Elevation shows a clear negative marginal effect: holding all other variables constant, higher elevations are associated with a systematically lower predicted probability of flood occurrence.
Distance to the nearest river also exhibits a strong negative effect, with flood probability decreasing steadily as distance from river networks increases, confirming the importance of hydrological proximity.
Short-term rainfall (7-day accumulation) has a strong positive and nearly linear marginal effect on flood probability, indicating that recent intense precipitation sharply increases the likelihood of flooding.
Monthly rainfall displays a similarly strong positive effect, capturing longer-term hydrological saturation processes that substantially raise flood risk even after controlling for short-term rainfall.
Drainage index has a pronounced negative effect: locations with better drainage capacity (higher index values) are associated with significantly lower predicted flood probabilities.
Overall, the marginal effects plots confirm that all five predictors contribute meaningfully and in the expected directions to flood occurrence risk, with precipitation variables exerting the strongest influence, followed by drainage conditions and topographic–hydrological factors.
In this final section, we evaluate the predictive behaviour of the selected Generalized Linear Models on unseen data. Using the previously defined train–test split, we assess out-of-sample predictive performance and compute prediction intervals for the estimated flood occurrence probabilities.
For each selected model, we obtain predicted probabilities on the test set, construct approximate 95% confidence intervals on the response scale, and evaluate classification performance using a probability threshold. This allows us to assess both predictive uncertainty and practical classification accuracy.
# ========================================
# Model 1: Environmental + Hydrological
# ========================================
# First, we re-define our model
glm_M1_env <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index,
data = TrainData,
family = binomial(link = "logit")
)
# 1) Predicted probabilities + 95% CI (test set)
# Predict on the linear predictor (log-odds scale)
pred_link_M1 <- predict(
glm_M1_env,
newdata = TestData,
type = "link",
se.fit = TRUE
)
# 95% confidence intervals on the link scale
crit <- 1.96
link_fit <- pred_link_M1$fit
link_lwr <- link_fit - crit * pred_link_M1$se.fit
link_upr <- link_fit + crit * pred_link_M1$se.fit
# Transform back to probability scale
pred_M1 <- data.frame(
lower = glm_M1_env$family$linkinv(link_lwr),
prediction = glm_M1_env$family$linkinv(link_fit),
upper = glm_M1_env$family$linkinv(link_upr)
)
head(pred_M1)## lower prediction upper
## 1 0.047548529 0.053294793 0.059691974
## 2 0.001365132 0.001700045 0.002116949
## 3 0.003248647 0.003863529 0.004594255
## 4 0.275066105 0.296629008 0.319138119
## 5 0.068709125 0.076237559 0.084516021
## 6 0.008825952 0.010313385 0.012048448
# 2) Classification performance (0.5 threshold)
# Predicted probabilities
prob_M1 <- predict(glm_M1_env, TestData, type = "response")
# Class prediction using 0.5 threshold
class_M1 <- ifelse(prob_M1 > 0.5, "Yes", "No")
# Accuracy
acc_M1 <- mean(class_M1 == TestData$flood_occurrence_current_event)
acc_M1## [1] 0.9118
# ====================================================
# Model 2: Comprehensive (Env + Socio + Categorical)
# ====================================================
# First, we re-define our model
glm_M4_full <- glm(
flood_occurrence_current_event ~
elevation_m +
distance_to_river_m +
rainfall_7d_mm +
monthly_rainfall_mm +
drainage_index +
population_density_per_km2 +
built_up_percent +
infrastructure_score +
landcover +
soil_type +
urban_rural,
data = TrainData,
family = binomial(link = "logit")
)
# 1) Predicted probabilities + 95% CI (test set)
pred_link_M4 <- predict(
glm_M4_full,
newdata = TestData,
type = "link",
se.fit = TRUE
)
link_fit <- pred_link_M4$fit
link_lwr <- link_fit - crit * pred_link_M4$se.fit
link_upr <- link_fit + crit * pred_link_M4$se.fit
pred_M4 <- data.frame(
lower = glm_M4_full$family$linkinv(link_lwr),
prediction = glm_M4_full$family$linkinv(link_fit),
upper = glm_M4_full$family$linkinv(link_upr)
)
head(pred_M4)## lower prediction upper
## 1 0.062025633 0.079603292 0.101622647
## 2 0.001387318 0.001909206 0.002626903
## 3 0.002828360 0.003730895 0.004920010
## 4 0.242199582 0.283974977 0.329819833
## 5 0.041691260 0.053237594 0.067755579
## 6 0.008390032 0.010887622 0.014118124
# 2) Classification performance (0.5 threshold)
prob_M4 <- predict(glm_M4_full, TestData, type = "response")
class_M4 <- ifelse(prob_M4 > 0.5, "Yes", "No")
acc_M4 <- mean(class_M4 == TestData$flood_occurrence_current_event)
acc_M4## [1] 0.9114
We can conclude that: